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Analysis of Thermally Induced Loss in 
Fiber-Optic Ribbons 


By G. S. BROCKWAY* and M. R. SANTANA 
(Manuscript received October 1, 1982) 


In this paper, added loss during temperature cycling in a given 
ribboned fiber is shown to be caused by thermally induced axial 
compressive strain imparted to the fiber. A microbending-sensitivity 
parameter 6 is introduced which reduces all loss-strain curves cor- 
responding to different fibers to one characteristic master curve. 
Thermoviscoelasticity theory is used to calculate the time- and tem- 
perature-dependent compressive strain imparted to a ribboned fiber 
during a standard environmental cycle. Combining these analytical 
results with environmental data, the functional relationship between 
fiber-compressive strain and the added loss for a fiber of any given 
6 in an Adhesive-Sandwich Ribbon (ASR) with Urethane-Acrylate 
(UA) coated fibers has been determined. Using this analysis, the 
added loss fora UA ASR can now be predicted for any environmental 
cycle. The critical material properties that dominate the environmen- 
tal performance of ASRs are the tape shrinkback at elevated temper- 
atures and the product aEA of the coefficient a of thermal expansion, 
the time- and temperature-dependent relaxation modulus E, and the 
area A of the coating. 


I. INTRODUCTION 


Unless special precautions are taken, fiber-optic cables installed in 
the outside plant could experience temperatures ranging from —45°F 
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to +190°F. Thus, optical transmission loss resulting from this thermal 
history directly impacts the coating and ribbon choices for a particular 
system design. An understanding of the relationships among the ther- 
mally induced strains on the “ribboned” glass fibers, the resulting 
added transmission loss, and fiber parameters is crucial to properly 
evaluating candidate fiber-coating materials, ribbon structures, and/or 
ribbon-matrix materials. Moreover, it is desirable to be able to predict 
long-term behavior from short-term testing, thereby simplifying the 
environmental testing procedure. 

The added loss in environmental testing is generally thought to be 
associated with the microbending of the axis of the fiber. Moreover, it 
has been shown by other investigators that when intimate contact is 
forced between a fiber and a microscopically rough surface, losses due 
to microbending can be substantial and are a function of the fiber 
geometric and optical parameters as well as the elastic modulus of the 
rough surface.’” Therefore, thermally induced strains on the ribbon 
structure can indirectly result in added loss by increasing the contact 
forces between fibers. In addition, the sensitivity of a particular fiber 
to microbending loss is said to be associated with irregularities at the 
core-cladding interface, which may be due to core-diameter variations 
and/or refractive index variations.° 

In this paper, a thermoviscoelastic analysis is used to compute the 
axial and transverse strains that are imparted to a fiber-optic ribbon 
when it is subjected to any given thermal cycle. This analysis shows 
that for the Bell System Adhesive-Sandwich-Ribbon‘ (ASR) construc- 
tion with Urethane-Acrylate (UA) coated fibers, the transverse strain 
due to lowering the temperature results, contrary to one’s expectation, 
in a reduction in the contact force between fibers. On the other hand, 
the resulting axial compressive strain evidently either increases this 
contact force or induces fiber buckling at the critical spatial wave- 
length, and thus creates added optical loss. Moreover, for a given 
ribboned fiber we establish herein the existence of a relationship 
between the environmental added loss and any measure of microbend- 
ing sensitivity of the fiber. This equivalence enables us to predict the 
environmental performance of any ribboned fiber over a wide temper- 
ature-time span from data gathered over a relatively short time. In 
particular, the loss data collected from one excursion to —45°F can be 
combined with thermoviscoelastic data and analysis to predict the loss 
of any ribbon configuration with any choice of constituent materials, 
provided the tested fibers have a suitable range of microbending 
sensitivity. 


li. ENVIRONMENTAL TEST PROCEDURE 


The intent of environmental testing is to determine the effect of 
thermal exposure on the optical performance of fiber-optic ribbons. All 
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Table I—Environmental test 


cycle 
Exposure 
Tempera- Exposure 
Cycle ture, 7 Time, t 
Number (°F) (days) 
75 2 
~—45 2 
I —15 2 
15 2 
75 2 
170 30 
75 2 
—45 2 
II -15 2 
15 2 
75 2 
190 34 
75 2 
~45 2 
TI —15 2 
15 2 
75 2 


the ribbons tested were placed in a 23-inch-diameter cardboard cylin- 
drical container in a stem-pack fashion. The containers were then 
placed in a “walk-in” environmental chamber with both ends available 
for measuring purposes. Input and output array connectors’ were 
fabricated to measure loss by the reference-fiber technique. Namely, 
the 0.63-um loss of a fiber is obtained by taking the ratio of its output 
power to the average output power of ten reference fibers that are 
short enough to have negligible loss. 

Table I is a summary of the environmental testing cycle used for 
this investigation. The loss of the ribboned fibers* was obtained at the 
end of each exposure. The change in loss for a typical ribboned fiber 
is plotted in Fig. 1 for each temperature in Cycles I, II, and III. The 
loss of the ribboned fibers increases with decreasing temperatures, the 
maximum loss occurring at the lowest temperature. As is evident in 
Fig. 1, the loss also increases with increasing cycle number. We 
subsequently show that this “pumping effect” is due to polyester-tape 
shrinkback' and stress relaxation effects during the high-temperature 
exposure between cycles. 

It follows that for a 40-year design life, the worst-case environmental 
cycle would be a 40-year exposure at 190°F followed by a low-temper- 
ature exposure. This worst-case cycle is summarized in Table II. Note 


* Fibers used in this study had cladding and core diameters of 110 and 55 
ym, respectively. 
Shrinkback is the recovery of process-induced strains. 
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_--CYCLE II 


(] UNCERTAINTY 


ADDED LOSS IN DECIBELS PER KILOMETER 





TEMPERATURE IN DEGREES FAHRENHEIT 


Fig. 1—Environmental added loss at 0.63 um vs. temperature (UA-coated fiber, ASR 
construction). 


Table Il—Worst-case environmental 
cycle 


Exposure 
tempera- 
ture, T Exposure Time, t 
(°F) (days) 


190 14,610 (40 years) 
75 
—45 
—15 
15 
75 


Nw dvb bo 


that the cycle of Table II is not suggested for testing purposes but will 
only be used to predict the most pessimistic estimates of performance. 


Il. DUALITY BETWEEN MICROBENDING SENSITIVITY AND STRAIN 


A convenient means for measuring the microbending sensitivity of 
different fibers is to determine the wavelength-independent loss coef- 
ficient® of the fiber when wound in several layers on a 6-inch-diameter 
reel under a tension fixed for all fibers. Let 5 denote the wavelength- 
independent loss coefficient measured under these conditions. This 
method was contrived to produce artificially high losses, thereby 
magnifying the loss contribution due to microbending. Of course, the 
losses for these same fibers will be substantially smaller when measured 
in a stress-free configuration. 

The added-loss response to a given cycle (see Fig. 1) is characteristic 
of all fibers of a single given 6 (as defined above) in a given structure 
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A — CYCLE I, 75°F 

B — CYCLE I, -45°F 
C — CYCLE I, 75°F 
D — CYCLE I, -45°F 


ADDED LOSS 





FIBER COMPRESSIVE STRAIN 


Fig. 2—Environmental added loss vs. fiber compressive strain. 


but will change for fibers having different 5’s. Moreover, examination 
of all the environmental data collected in this study indicates that the 
change in loss at a given point in any cycle increases with increasing 
é. 

As is well known, microbending loss occurs when the axis of a fiber 
is bent into a curve whose spectrum contains a certain critical fre- 
quency.’ Transverse pressure against a fiber on a microscopically 
rough surface has been shown to introduce loss in this way.” However, 
in the case of environmentally induced loss in an ASR ribbon, it is 
shown in the Appendix that the transverse strain due to thermal 
contraction reduces the contact forces between fibers. Thus, the added- 
loss response of Fig. 1 must be due to another mechanism. The only 
reasonable possibility is that the compressive axial strain induces 
microbending loss through fiber buckling. This loss would occur if the 
fiber buckling were at the critical wavelength or if buckling increases 
the contact forces between the fiber and its surroundings. It is not our 
intent to identify which of these means dominates the behavior of the 
fiber but only to characterize its manifestations. In other words, we 
simply correlate the added loss with the compressive strain imparted 
to the fiber. 

Both the added loss and compressive strain increase with decreasing 
temperature; their relationship* is shown schematically in Fig. 2. Here, 
the losses at 75°F and —45°F in Cycle I are denoted by A and B, 
respectively. After high-temperature exposure, the shrinkback of the 
polyester tape increases the strain on the fiber, so that C and D 


* If strain and added loss monotonically increase with decreasing temperature, then 
it is easy to show that the added loss monotonically increases with increasing strain. 
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represent 75°F and —45°F in the second cycle. This explains the 
increased loss with each cycle that is evidenced in Fig. 1. This phenom- 
enon has been observed consistently in all environmental testing and, 
as previously noted, is called the pumping effect. 

Since the loss increases with increasing 6, the added-loss response to 
strain becomes steeper with increasing 4, as illustrated in Fig. 2. The 
effect of 6 in this graph is obviously a contraction of the strain scale 
with increasing 6. As to whether this contraction is uniform or not is 
readily tested by plotting the loss versus logarithmic strain, where a 
uniform contraction would then appear as a horizontal, rigid transla- 
tion of the added-loss curves with 6. As will be demonstrated in Section 
V, the loss-strain data may indeed be reduced in this fashion. Since 
the strain-scale contraction factor is a function of 6 alone (the contrac- 
tion is uniform), it creates a one-to-one correspondence (duality) 
between 6 and compressive strain, changes in 6 being equivalent to 
changes in strain. With this in mind, we proceed to a calculation of the 
axial strain in the environmental cycle. 


IV. THERMOMECHANICAL ANALYSIS 


This section is devoted to the summary of the formulas needed to 
calculate the axial strain history to which a fiber-optic ribbon is 
subjected during any environmental cycle. All of the plastics used in 
an ASR structure possess time- and temperature-dependent moduli, 
two examples of which are shown in Fig. 3. If the plastic is instanta- 
neously strained, the stress required to sustain that strain relaxes with 
time according to the given curve. Notice that the relaxation that 
occurs in the UA modulus at 140°F in one hour takes more than 40 
years at room temperature. The relaxation at —45°F is slower yet, so 
that the UA coating is as stiff as the polyester tape over significant 
time periods at these low temperatures. 

On the other hand, the coefficients of thermal expansion of these 
plastics are constant with time provided the temperature changes do 
not encompass their glass transition temperatures, T,. Even so, by 
virtue of the time dependence of the plastic moduli, the ribbon itself 
exhibits time- and temperature-dependent thermal expansion and 
contraction during the thermal cycle. This situation is depicted in Fig. 
4, Since the expansion coefficients ap of the plastics are much greater 
than that (ac) for glass, the fibers restrain the contraction of the 
plastic, the net contraction aAT of the ribbon being determined by 
equilibrium (force balance) in the structure. The relatively high short- 
time modulus of the plastic (see Fig. 3) causes the initial contraction 
(expansion) to be high. As the modulus of the plastic relaxes, the 
energy stored in the glass causes the ribbon to recover some of this 
high initial strain. Of course, the rate of this recovery depends on the 
temperature at which it occurs. 
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Fig. 3—Time dependence of moduli for ribbon constituents at various temperatures 
(generated by time-temperature superposition from data collected by R. P. DeFabritis). 


The glass also resists shrinkback e§“ of the polyester tape during 
exposure to elevated temperatures. If it were unrestrained, the tape 
would shrink back according to the results in Fig. 5. As illustrated in 
Fig. 6, the shrinkback es that the ribbon experiences is much less than 
that of the free tape. 

The compressive strain induced in the ribbon by its thermal con- 
traction upon an excursion to —45°F is thus increased in each subse- 
quent cycle by the shrinkback during the high-temperature exposure. 
We now proceed to a calculation of these strains. 


4.1 Axial thermal expansion 


If the plastic phases were considered to be elastic, the axial tensile 
modulus of a fiber-optic ribbon would be approximated well by the 
rule of mixtures 


by EiA; 


=a (1) 


E. 





where E; is the modulus of the ith constituent, A; its area, and the 
summation is taken over all phases. In reality, (1) is a lower bound for 
the effective elastic modulus, being exact if all the constituents have 
the same Poisson’s ratio.’ Furthermore, eq. (1) can be derived in an 
elementary fashion by considering the equilibrium of the ribbon if one 
ignores Poisson’s effect and assumes that all constituents are equally 
strained. 
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RIBBON SCHEMATIC 


Ep(T, t) PLASTIC, P 


75°F 





GLASS, G 


es apAT— | 
es 
Ar | -45°F 


=o | 


mira 


Q(T, t) 





TIME 


Fig. 4—Mechanics of ribbon thermal contraction. 


TR = 140°F 
C.J. ALOISIO’S 
DATA COLLECTED BY 
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IN PERCENT 
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Fig. 5—Shrinkback strain vs. time for 3M No. 5 polyester tape. 
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Fig. 6—Mechanics of ribbon shrinkback. 


This same kind of elementary argument leads to the formula 


> aj; HA; 


t 


= > EA; 


Qe 


for the coefficient of thermal expansion of the ribbon, given elastic 
moduli £;, areas A; and coefficients of thermal expansion 
a;. Schapery has shown that (2) is exact if each phase has the same 
Poisson’s ratio and is a very good approximation in any case.’ These 
elastic results may be used to generate their viscoelastic counterparts. 
assume that all constituents obey constitutive equations 


o(t) = i E(T(¢), t — r)é,(7)dt 


‘  d 
E(T, t)=ac(T)E(Tr,é), €= [ ACTON 


FIBER-OPTIC RIBBONS 


and 
€.(t) = e(t) — af T(t) — To], o = T(0), (4) 


where ¢€, o, and F(T, -) are the strain, the stress and the relaxation 
modulus at temperature 7, respectively. Here, a superposed dot on a 
function denotes differentiation with respect to its argument. On a log- 
log plot of modulus versus time, log ag(T) and log ar(T) represent, 
respectively, the vertical and horizontal distances that the curve for a 
temperature T must be translated to lay over that for the reference 
temperature T',. All the materials considered here, including glass as 
a trivial case, conform to this hypothesis. 

When (3) and (4) are met, it is readily shown that equation (1) 
continues to hold in the viscoelastic case, so that 


¥ EAT, t)Ai 


E(T, t) ae ra (5) 


is exact (Poisson’s effect being ignored). Moreover, as long as the curve 
of the logarithm of the modulus versus logarithmic time has small 
curvature, a good approximation to a(T, t) is provided by the so-called 
quasi-elastic approximation*® 


» aT, t)Ai 


 £) =—_____. 6 
SS Sn Taya, se 
The strain in the ribbon due to a temperature history T in the absence 
of stress is characterized by 


e(t) = | a(T(t), t— 7)T(t)dr. (7) 
0 


Notice that the ribbon has a time-dependent expansion coefficient 
even though we have supposed that the constituent plastics do not. 
This latter assumption is valid as long as the temperature excursion 
does not encompass the glass transition temperature T, of any con- 
stituent. Equation (6) can easily be shown to continue to hold even if 
the constituents have time-dependent expansion coefficients. One need 
only replace each a; by a;(T, t). For the purposes of this investigation, 
it is sufficient to assume each aq; to be constant. 

The data needed to calculate the effective modulus E and coefficient 
a of thermal expansion of the ribbon at —45°F and 75°F for a time of 
10 hours is given in Table III. The geometric data of Table III was 
calculated using the nominal dimensions shown on the cross-sectional 
view of a typical ribbon in Fig. 7. Notice that although the glass makes 
far and away the most significant contribution to the stiffness of the 
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Table III—Geometric and material properties of ribbon constituents 
Ten-Hour Modulus 





5 os 2 akA 
a : ia : E(10 psi) EA(10 lb) (107*lb/°F) 
1 ie ae ee 
in”) °F!) —45°F 75°F  -45°F 75°F —45°F 75°F 
3M No.5 Polyester 3.00 1.06 5.27 467 158 140 167 148 
Tapes 
3M No. 5 Acrylic 3.60 7.72 0.00 0.00 0.00 0.00 0.00 0.00 
Adhesive 
Urethane-Acrylate 5.87 3.33 4,22 0.74 248 043 £8.26 1.43 
Coatings 
Glass Fibers 1.77 0.028 107 107 18.94 18.94 0.53 0.53 
Total 14.24 23.00 20.77 10.46 3.44 
_YakA _ 4.55 x 10° °F"' @ —45°F 
Thus, eq. (6) —> a = "SE = 1 66 x 10-° °F @ 75°F 
Laveen aaee URETHANE-ACRYLATE 
= COATING 
ACRYLIC 7 t | 
ADHESIVE “\ TS 






—<CoQYVEO) 










Fig. 7—Cross section of a twelve-fiber ribbon. 


ribbon, other constituents play a major role in the calculation of a. 
Indeed, at —45°F the aKA of the glass may well be neglected in 
comparison with that of the coating. At 75°F, on the other hand, the 
tape, coating and glass make comparable contributions to the effective 
thermal expansion coefficient. 

In Fig. 8, a and E are plotted logarithmically against the logarithm 
of time at 75°F. It is evident that the change in a with respect to 
temperature and time is much more significant than the change in 
ribbon modulus. The coefficients of thermal expansion listed for each 
of the constituents in Table III are for temperatures below their 
respective glass transition temperatures, since our interests are in 
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LOG E IN psi 
LOG @ IN °F7! 





-10 -5 0 5 10 


LOG TIME IN HOURS 


Fig. 8—Time dependence of the modulus and the coefficient of thermal expansion for 
the nbbon at 75°F (UA-coated fibers, ASR construction). 


calculating the compressive strains imparted to the fiber during the 
low-temperature excursion. 
If the ribbon is exposed to the temperature history 


0 ¢<0 


1 t20 (8) 


T(t) = To + 2 (Tr — Tr-)H(t — tx), H(t) = 
=1 


0<t<th<---<#,<t, the resulting strain history is 


n 


e(t) = x (Tr — Tr-i)a(Tn, t — tr), (9) 


==] 


as found from substituting (8) into (7). Observe that eq. (9) contains 
all relaxation effects associated with the temperature history in (8). 


4.2 Axial shrinkback 


We now set out to calculate the shrinkback induced in the ribbon by 
the shrinkage of the polyester tape during high-temperature exposure. 
Denote by e§*(Tr, t) the shrinkback (see Fig. 5) the tape would 
experience in the time period [0, ¢] at some reference temperature* TR 
if it were unrestrained. At any other temperature 7, the unrestrained 
shrinkback in the same time period would be 


e§4(T, t) = €$4(Tr, t/ar(T)), (10) 


where ar is the temperature-dependent scale-contraction factor. If the 
tape is subjected to temperature T; for a time t; and then to T° for ¢2, 
the total shrinkback would be 


*Tr= 60° in Fig. 5. 
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e§4(Tr, th/ar(T1) + te/ar(T2)). 


For the polyester tape used in the ASR construction, shrinkback is 
observable in the environmental cycle only during the high-tempera- 
ture exposures, the time scales being much too long at the lower 
temperatures. 

An analysis similar to that outlined in Section 4.1 for the time- 
dependent modulus and coefficient of thermal expansion of the ribbon 
can be used to obtain the approximate formula for the shrinkback 
strain of the ribbon és. 


es(t) = S(T, t)eg*(t), (11) 


where 
Era(T, t)Ata 


S(T, t) = TENT, A, 


(12) 


A double logarithmic plot of the ribbon shrinkback function S as 
calculated from (12) is shown in Fig. 9 for T = 140°F. Observe that the 
shape of log S(T, -) is very much like that of the logarithm of the 
polyester modulus in Fig. 3. The ribbon shrinkback during any high- 
temperature exposure can be calculated by substituting the appropri- 
ate mechanical and shrinkback data (e.g., Figs. 3 and 5) into eqs. (11) 
and (12). The error made in using eq. (11) at 140°F has been shown to 
decrease monotonically from 10 percent at two hours to 8.8 percent at 
ten hours. The approximation of eq. (11) increasingly improves as time 
goes on and as temperature increases. Since our interest is in times 
greater than forty-eight hours at temperatures over 170°F, (11) is quite 
acceptable. 


‘4.3 Calculation of fiber compressive strains in the environmental test 
cycle 


As discussed in Section III, a compressive strain applied to a rib- 
boned fiber induces added optical loss. This strain er for the temper- 
ature history T [see eq. (8)] is given by 


er(t) = er(t) — ar[ T(t) — To], (13) 


where ar is the linear coefficient of thermal expansion of the glass fiber 
and 


er(Z) = e(t) + es(t) (14) 


is the ribbon strain. In (14), e€ is the strain due to the thermal 
contraction of the ribbon [eq. (9)], and es is the ribbon shrinkback 
strain [eq. (11)]. 

Thermoviscoelastic data on each of the constituent materials were 
incorporated into computer-programmed versions of eqs. (6), (9), (11), 
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LOG S$ 





LOG TIME IN HOURS 


Fig. 9—Time dependence of ribbon-shrinkback function at 140°F (UA-coated fibers, 
ASR construction). 


(12), and (13) to obtain the fiber strains on each excursion to —45°F. 
The results are plotted in Fig. 10. 


V. THE MASTER CURVE FOR ADDED LOSS VERSUS COMPRESSIVE 
STRAIN 

The purpose of this section is to construct the curve of added loss 
versus compressive strain (master curve) for a fiber having an arbitrary 
value of the microbending sensitivity parameter 6. This will be done 
by demonstrating that the effect of 6 on the loss-strain curve is to 
uniformly contract the strain scale, which appears as a rigid shift 
(translation) of the curves with 6 when the loss data are plotted against 
logarithmic strain. This master curve together with the 6-shift curve 
can then be used to predict the performance of any fiber (6 known) in 
any environmental cycle. 

In Fig. 11, the environmental added-loss data for five fibers are 
plotted versus the logarithm of the axial compressive strain on the 
glass fiber as calculated in Section IV (Fig. 10). These data include all 
the measurements for Cycles I through III according to Table I. 
Neither of the points for 75°F in Cycle I appears in these figures 
because for the first the strain is zero and for the second it is tensile 
(negative), so that both logarithms are undefined. 
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Fig. 10—Calculated fiber compressive strain vs. temperature in the environmental 
cycle (UA-coated fiber, ASR construction). 


5.1 Shifting with 5 to form the master curve 


The plots in Fig. 11 were laid over one another and shifted horizon- 
tally by hand until they formed a single, smooth curve. The amount of 
shift log a.(5) required for each of the five values of 6 is shown in Fig. 
12 referenced to 6 = 1.0 dB/km. The linear, least-squares fit shown in 
Fig. 12 results in an excellent approximation of the shift data. For 
simplicity this linear approximation is used in the subsequent construc- 
tion of the master curve. 

The resulting master curve is shown in Fig. 13. The data points for 
the five different fibers are shown with different symbols on this plot. 
This master curve together with the 6-shift curve of Fig. 12 can be 
used to obtain the added-loss-versus-strain profile for a fiber of any 
given 6. A fourth-order polynomial has been fitted to the master curve 
data and is included in Fig. 13. Observe that the scatter about the best- 
fit polynomial is within the loss measurement uncertainty. Notice also 
that there is significant overlap in the data from fiber to fiber. On 
account of the functional relationship between a, and 6 (illustrated in 
Fig. 12), we may view the master curve of Fig. 13 as added loss versus 
strain for a fixed value of 6 or added loss versus 6 for a fixed level of 
strain. This makes precise the 6-strain duality alluded to in Section 
II. 
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Fig. 11—Environmental added loss at 0.63 pm vs. calculated fiber strain in the 
environmental test (UA-coated fibers, ASR construction). 
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Fig. 12—Strain-scale-contraction factor vs, microbending sensitivity parameter 6 (ref- 
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Fig. 13—Environmental added loss vs. 5-reduced fiber strain (referenced to 6 = 1.0 
dB/km, UA-coated fibers, ASR construction). 
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Although the master curve of Fig. 13 has been constructed from data 
collected in all three cycles, this was not necessary. Indeed, a polyno- 
mial fit through Cycle I is virtually indistinguishable from the one 
shown in Fig. 13. Thus, we are led to conclude that only one low- 
temperature excursion is necessary to complete an analysis of the type 
presented here, provided fibers having a suitable range of 8’s are tested. 
This observation greatly simplifies the environmental-testing proce- 
dure for fiber-optic ribbons. 


5.2 Prediction of loss in the worst-case cycle 


We are now in a position to predict the behavior of a UA ASR in 
the worst-case environmental cycle. To do this we need only add to 
the strain associated with each temperature and time (see Fig. 10) the 
shift factor associated with the appropriate value of 5 (see Fig. 12) for 
the fiber in question. The loss can then be read off the master curve 
(Fig. 13). For example, the strain er in the worst-case cycle at —45°F 
obtained from Fig. 10 is 0.105%. The shift corresponding to a 6 of 2.32 
dB/km from the equation in Fig. 12 is log a,(2.382) = 0.51. Thus, the log 
of the 6-reduced strain [log era.(2.32)], —0.47, yields an added loss of 
9.8 dB/km from Fig. 13. This procedure was employed to generate the 
curves in Fig. 14 for the ribbon behavior in the worst-case cycle for 
different values of 6. 

Notice for a 6 of 2.32 dB/km in Fig. 14 that the loss in the worst- 
case cycle is much worse than that of Cycle III shown in Fig. 1. This 
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Fig. 14—Environmental added loss at 0.63 um vs. temperature in the worst-case cycle 
(UA-coated fibers, ASR construction). 
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of course is the shrinkback effect previously discussed. Also, observe 
from comparing Figs. 11 and 14 that a fiber with 6 = 0.54 dB/km would 
meet a 0.5 dB/km added-loss criterion after Cycle III but not after the 
worst-case cycle. It is therefore clear that data sufficient to allow long- 
term predictions are essential to any meaningful ribbon evaluation. 
Finally, we remark that the added loss for a UA ASR can be predicted 
in the way demonstrated above for any environmental cycle provided 
one calculates the appropriate strain history by means of the formulas 
in Section IV. 


VI. APPLICATIONS 
6.1 A Criterion for coating/ribbon-structure comparisons 


If one were to choose a ribbon-performance requirement of 1.0 dB/ 
km in the worst-case cycle, then it is clear from Fig. 14 that an ASR 
made with a population of UA-coated fibers having high 6’s would be 
unacceptable. The same structure manufactured with low-6é fibers 
would satisfy the requirement. Thus, one can easily err out of ignorance 
of the strong effect arising from the sensitivity of the fiber to micro- 
bending (as reflected in 5). Obviously, for a given performance require- 
ment there is a maximum permissible 6-value, 8. That is, if ribbons are 
manufactured from a fiber selection having 6’s less than £, the given 
performance requirement will be met. Clearly then, it is not a question 
of whether the UA-ASR structure is acceptable or not, but rather how 
restrictive the value of 8 is. The parameter # is therefore an effective 
criterion for comparing various coating and ribbon-structure combi- 
nations. . 

To calculate this maximum permissible 6-value, 6, for the VA ASR 
given a 1.0 dB/km added loss in the worst-case cycle, we enter into the 
master curve (Fig. 13) at a loss of 1.0 dB/km and read off the associated 
reduced strain 


log era.(B) = —1.09. (15) 
Next, introduce the equation for the 6-shift curve (Fig. 12), 
log a.(B) = 0.384(B — 1) 
into (15) to arrive at 


_ 1.09 + log er 


te 0.384 (16) 


Now the —45°F strain in the worst-case cycle is 0.105 percent, so that 
(16) yields 
B = 0.7 dB/km. 


Similarly, for a performance requirement of 0.5 dB/km, one finds 
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B = 0.2 dB/km. 


Thus, UA-coated fibers selected with 6’s less than £ could be used in 
an ASR and satisfy the chosen performance requirement for a 40-year 
design life. 


6.2 Critical material and geometric properties 


The maximum permissible 6, 8, can also be used to evaluate how 
material and geometric properties might be altered to improve envi- 
ronmental performance. It is obvious that a reduction in polyester- 
tape shrinkback would reduce environmental added loss. This is illus- 
trated dramatically in Fig. 15, where the parameter £ is plotted versus 
shrinkback normalized with respect to that of the present tape in the 
worst-case cycle (see Fig. 5). The two curves are for performance 
standards of 0.5 and 1.0 dB/km in the worst-case cycle. Notice that a 
50-percent reduction increases f from 0.2 to 0.5 dB/km for the 0.5 dB/ 
km performance requirement, while elimination of the shrinkback 
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Fig. 15—Effect of shrinkback on the maximum permissible 5-value for the worst-case 
cycle (UA-coated fibers, ASR construction). 
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altogether increases 8 to 0.8 dB/km. Observe from eqs. (11) and (12) 
that the ribbon shrinkback is linear with respect to the area of the 
tape. Thus, Fig. 15 can also be viewed as a plot of 8 versus reduced 
area normalized to that of the present tape (see Table III). 

A consideration of Table III reveals that reducing the aZA product 
of the coating will likewise diminish the strain on the fibers, since the 
coefficient of thermal expansion of the ribbon would then be reduced. 
The coefficients a of thermal expansion are essentially the same for all 
plastics though the moduli E can vary substantially from one material 
to another. Notice further that the area A of the coating is greater 
than that of any other ribbon constituent. As to whether this reduction 
in fiber strain is accompanied by an environmental performance im- 
provement depends on the extent to which the added-loss-versus- 
compressive-strain master curve and its associated 6-shift character- 
istic are influenced by the changes adopted to reduce the fiber strain. 
Changing a of the coating has no effect on these curves while changing 
coating modulus EF and/or the ribbon geometry may have a substantial 
effect. The variation in the shape of these curves with changes in these 
various parameters can be ascertained only by completing a charac- 
terization of the type carried out here for the UA ASR. 

Figure 16 shows how strong an effect reducing thermal contraction 
can be when the change in the added-loss-reduced-strain characteristic 
is negligible. Here we have plotted 8 against the coated-fiber outer 
diameter holding the fiber diameter constant. Observe that if the outer 
diameter of the coated fiber is reduced from 9 to 6 mils the effect on 
B is bigger than the effect of a 50-percent reduction in tape shrinkback. 
Letting the coating thickness tend to zero results in an increase in B 
from 0.2 to 1.0 dB/km for the 0.5 dB/km performance requirement. If 
a 50-percent reduction in tape shrinkback is combined with a reduction 
in coated-fiber outer diameter to 6.5 mils, one can show that £ increases 
from 0.2 to 0.9 dB/km. Finally, we remark that these same dramatic 
effects should occur with reductions in the coating modulus. 

Other investigators have established that dual-coated fibers having 
a soft primary coating are less sensitive to microbending than fibers 
having a hard, single coating.’ Such dual-coated fibers in an ASR 
should also behave well environmentally. To see this, consider a 
secondary coating of UA over silicone where the area of the UA skin 
is 1.56 x 10° in.” Neglecting the stiffness of the silicone (it is three 
orders of magnitude smaller than that of UA), we obtained an effective 
single-coated UA diameter of 6.4 mils. The corresponding increase in 
B (Fig. 16) is from 0.2 to 0.6 dB/km for the 0.5 dB/km performance 
criterion. As previously mentioned these conclusions are predicated 
upon the assumption that the basic loss-strain characteristic of the 
ASR structure is relatively insensitive to changes in coating modulus. 
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Fig. 16—Effect of coating thickness on the maximum permissible 6-value for the 
worst-case cycle (UA-coated fibers, ASR construction, fiber diameter = 4.33 mils). 


Vil. CONCLUSIONS AND RECOMMENDATIONS 


We have shown that: 

(¢) Environmental added loss in an ASR is associated with axial 
compressive strain (not transverse strain) imparted to the fibers by 
thermal contraction of the ribbon in the low-temperature excursion 
and ribbon shrinkback together with relaxation during the high-tem- 
perature exposure. 

(it) There is a duality between a measure 6 of the microbending 
sensitivity of a fiber and the fiber-compressive strain; viz., the effect of 
6 on the characteristic added-loss response to compressive strain is a 
uniform contraction of the strain scale. 

(tit) Points (z) and (ii) above result in a master curve of environ- 
-mental added loss versus 6-reduced fiber strain together with a plot of 
the strain-scale contraction factor versus 6-value. We are thus led to 
the algorithm depicted schematically in Fig. 17. Therefore, when 
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suitable material, geometric, environmental, and optical data are syn- 
thesized by way of our theory, we obtain a master curve and a 6-shift 
curve that are sufficient to predict the environmental added loss for 
the candidate ribbon configuration for any desired environmental 
cycle. Notice that this environmental-testing scheme eliminates the 
necessity for long-term testing, only one low-temperature excursion 
being required. 

(tv) There is a maximum value of 6 that a particular ribbon design 
can accommodate and still meet a given performance requirement. 
This maximum permissible 6-value is an excellent criterion for making 
coating/ribbon-structure comparisons. 

By application of these techniques, we have found: 

(v) The critical material and geometric parameters in ASR design 
are the shrinkback of the polyester tape and the product aHA of the 
coefficient a of thermal expansion, the time- and temperature-depend- 
ent relaxation modulus £, and the area A associated with the fiber 
coating. 

(vi) Any method of reducing the aEA product of the coating such 
as introducing a soft, single coating or a dual system with a soft 
primary coating should improve environmental performance. 
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APPENDIX 
Transverse Thermal Contraction of an Adhesive-Sandwich Ribbon 


Consider a section of an ASR containing two coated fibers as shown 
in Fig. 18. If the temperature is instantaneously lowered by an amount 
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Fig. 18—Two-fiber portion of an ASR cross section prior to thermal contraction. 
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|AT| (AT < 0) from the temperature J, the section AB of tape would 
have an unstressed length /ra of 


lra = 20(1 + araAT), (17) 


where 6 is the outer radius of the coated fiber and ara is the coefficient 
of transverse thermal expansion of the polyester tape. 

A coated fiber would also shrink in diameter, the final radius b; 
conforming to 


by = 0(1 + a, AT), (18) 


where a, is a number somewhere between the coefficients of thermal 
expansion of the glass and the plastic coating. To estimate a, we 
appeal to linear thermoelasticity theory. 

Denote by uw and o the radial displacement and stress fields (assumed 
axisymmetric) in the coating after the temperature is lowered. The 
stress field o is to vanish at r = b, and it must match the stress in the 
glass fiber at r = a. The displacements must also match at this fiber- 
coating interface. Since the glass is many times stiffer than the plastic, 
we approximate the boundary conditions at r = a by supposing the 
glass-coating interface to shrink with the glass coefficient of thermal 
expansion. Thus, 


u(a) = aarAT, o(b) = 0, (19) 


where ar is the coefficient of thermal expansion of the glass fiber. 

When the axisymmetric solution to the requisite field equations’ is 
substituted into (19), a pair of simultaneous, linear algebraic equations 
for two unknown constants result. Solving these equations and doing 
some algebra, one deduces that 


u(b) = a,bAT, (20) 


where 
4 


Shared — 1l)a + ar] (21) 


ay 
when Poisson’s ratio of the plastic is taken as 1/3. 

For the case at hand, we use the values for a and 6 from the table 
in Fig. 7 and select a and ar for the urethane-acrylate coating and the 
glass, respectively, from Table III. Substitution of these numbers into 
(21) gives 


a, = 3.18 x 10° °F, (22) 


which is 95 percent of the urethane-acrylate value. 
If the final length Jra of the tape segment is less than the final 
diameter 20; of the coated fiber, pressure will be created between the 
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fibers and microbending loss may result. Otherwise, the fibers will 
separate. Thus, the critical quantity is 


d = Ipy — 20;. (23) 
Incorporating eqs. (17) and (18) into (23) results in 
d = 2bAT(ara — ay). (24) 


For the case at hand ata < a, since the transverse thermal expansion 
coefficient for the tape is even less than the axial value reported in 
Table III. We thus conclude from (24) that d is positive (recall that 
AT <0). 

Thus as the temperature is lowered the fibers separate and micro- 
bending loss from transverse contraction is therefore impossible. As 
the temperature is raised, however, the fibers obviously approach one 
another since at high temperatures ara is still less than a,, but now 
AT > 0. Moreover, there is transverse shrinkback at high temperatures. 
There is indeed environmental loss data which indicate increased loss 
at high temperatures. 
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Adaptive Linearization of Power Amplifiers in 
Digital Radio Systems 


By A. A. M. SALEH and J. SALZ 
(Manuscript received October 5, 1982) 


High-frequency power amplifiers operate most efficiently at satu- 
ration, 1.e., in the nonlinear range of their input/output characteris- 
tics. This phenomenon has traditionally dictated the use of constant 
envelope modulation methods for data transmission, resulting in 
circular signal constellations. This approach has inherently limited 
the admissible data rates in digital radio. In this paper we present a 
method for solving this problem without sacrificing amplifier power 
efficiency. We describe and analyze an adaptive linearizer that can 
automatically compensate for amplifier nonlinearity and thus make 
it possible to transmit multilevel quadrature amplitude modulated 
signals without incurring intolerable constellation distortions. The 
linearizer utilizes a real-time, data-directed, recursive algorithm for 
predistorting the signal constellation. Our analysis and computer 
simulations indicate that the algorithm is robust and converges 
rapidly from a blind start. Furthermore, the signal constellation and 
the average transmitted power can both be changed through software. 


I. INTRODUCTION 


Progress in high-speed data transmission over radio channels has 
lagged behind that of the voiceband channel. Inherent difficulties 
associated with implementing automatic equalizers are partially re- 
sponsible, but the application of multilevel quadrature amplitude 
modulation (QAM) also has been inhibited by the amplitude 
(AM/AM) and phase (AM/PM) nonlinearities present in radio-fre- 
quency (RF) power amplifiers. 

Recent work’ has evolved design principles showing a possibility of 
substantial improvements in QAM performance over linear fading 
radio channels. A crucial obstacle to achieving these gains is the 
nonlinear distortion introduced by power amplifiers. Attempts to re- 
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alize high-speed data transmission over these channels force consid- 
eration of methods to cope with this nonlinear distortion. 

One approach is to back off from saturation sufficiently so that the 
signal level is restricted to the linear range of amplification. The 
required amount of power back-off can be several decibels, resulting in 
an inefficient operation of the power amplifier. Moreover, the achieve- 
ment of a given desired level of average transmitted power would 
require the use of a large, expensive, and high-power-consuming am- 
plifier. 

It has been realized?’ that some improvements in this regard can be 
obtained by using fixed signal predistortion circuits prior to amplifi- 
cation. Such circuits, however, cannot compensate for drifts in power 
amplifier nonlinearities caused by temperature changes, dc power 
variations, and component aging. These fluctuations can considerably 
degrade performance of systems employing constellations with large 
numbers of points, say 64 or higher, and the use of an adaptive 
technique is necessary in these applications. 

Conceptually, the nonlinear distortion introduced by the power 
amplifier can be minimized at the receiver by an adaptive nonlinear 
equalizer. Such schemes have been proposed and studied in voiceband 
data transmission® and for filtered PSK signals operating over satellite 
channels.’ This approach does not seem to be reasonable or necessary 
in our application, since the source of the nonlinearity is at the 
transmitter. Thus, it would appear logical to equalize the nonlinearity 
at the transmitter, where it occurs and where the transmitted bits are 
available. 

Thus, this paper focuses on the problem of adaptive predistortion 
linearization. We describe a transmitter-based recursive algorithm for 
predistorting the signal constellation, thereby rendering a virtually 
linear transmitter. The algorithm operates in real time and is data 
directed. The predistortion is accomplished within a digital memory, 
which is used to generate the desired baseband signal. This maximizes 
the use of digital technology, and increases the reliability and flexibility 
of the system. For example, it is possible to change the signal constel- 
lation and the average transmitted power through software. Our treat- 
ment applies only to single-valued, memoryless nonlinearities. 

The idea of adaptive predistortion of signal constellations has been 
previously suggested.’° In this reference, the predistortion of each 
point of the constellation is accomplished by switching the RF or 
intermediate-frequency (IF) input signal to a separate path containing 
an adjustable diode attenuator and an adjustable diode phase shifter. 
The analog hardware required for such an implementation would be 
quite involved, especially for a large number of points. We have also 
found in the patent literature’ a description of a digital adaptive 
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predistorter that appears to be similar to the one we present. here. 
However, no detailed description of the algorithm or its behavior is 
provided. 

Our basic ideas are described in Section II. Mathematical analysis 
is provided in Section III, and simulation results of the adaptive 
linearizer with finite-precision arithmetic are given in Section IV. 


ll. GENERAL DESCRIPTION AND REQUIREMENTS OF OPERATION 


A block diagram of a QAM transmitter with the proposed predistor- 
tion linearizer is shown in Fig. 1. A random access memory (RAM) 
contains the predistorted values of the in-phase and quadrature volt- 
ages of each point on the QAM constellation. A memory-lookup 
encoder obtains each input data symbol and generates the RAM 
addresses of the desired signal point. The corresponding stored, pre- 
distorted voltage values are converted to analog voltages using a pair 
of digital-to-analog (D/A) converters. These voltages drive a quadra- 
ture modulator, which generates the desired predistorted RF signal for 
the duration of the input symbol. That signal is then amplified, filtered, 
and transmitted. This part of the linearizer is similar in operation to a 
recently proposed memory-based encoder described in Ref. 12. 

To update the RAM information, the amplifier output is sampled 
by a directional coupler and demodulated using the same local oscil- 
lator used for the modulator, which eliminates the need for carrier 
recovery. The output in-phase and quadrature voltages of the de- 
modulator are converted to digital form using a pair of analog-to- 
digital (A/D) converters. A linearizing processor, which is the heart of 
the linearizer, receives this information, compares it with the input 
data, and computes the resulting error. A recursive algorithm, which 
is discussed in the next section, uses the error to update the voltage 
values in the RAM corresponding to the particular data point under 
consideration. Note that each point on the signal constellation is 
treated separately. Thus, the linearizer can support any desired con- 
stellation. 

The memory-lookup encoder and the D/A converters have to op- 
erate, of course, at the full signaling rate. However, the linearizing 
processor and the A/D converters can operate at a much reduced rate, 
since the updating process is only needed to compensate for drifts that 
occur on a much slower time scale than the data rate. 

We now emphasize a crucial point. The proper operation of the 
linearizer as described above assumes that the amplifier is memoryless, 
and requires that the signal not be filtered before the power amplifier. 
Thus, all pulse shaping and filtering must be performed by the com- 
bination of the post-amplifier, RF bandpass filter, and the receiver 
filters. The former filter should be designed just to meet FCC (or 
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Fig. 1—Schematic representation of the adaptive, digital, predistortion linearizer. 
























other) emission rules for square input pulses, as described in Ref. 13. 
Such an implementation may require an automatic equalizer at the 
receiver to eliminate residual intersymbol interference. It was shown’ 
that, even with ideal filtering, adaptive equalization would still be 
necessary to compensate for multipath fading using QAM signals with 
large numbers of levels. Thus, the elimination of pre-amplifier pulse 
shaping appears to be a mild constraint. 

Because of the more stringent requirement on the post-amplifier, 
RF bandpass filter, its loss would of course be increased over that of 
the conventional case where preamplifier filtering does the spectrum 
shaping. However, computations” based on practical filters’ operating 
in the 6-GHz band show that the RF filter loss would only increase 
from about 0.5 dB to 1.5 dB. However, we will see in Section IV that 
the linearizer in our system would allow the operation of the power 
amplifier to approach saturation. This would result in several decibels 
of power increase, which would more than compensate for the one- 
decibel increase in the filter loss. 


lll. THE RECURSIVE ALGORITHM 


As we already mentioned, our approach is applicable to any signal 
constellation. However, for clarity of exposition, we restrict our anal- 
ysis to a rectangular constellation. Referring to Figs. 1 and 2, we denote 
a point on the rectangular QAM constellation by the complex number 


a+ ib = pe”, 


where, for an L?-level system, a and b assume values on the lattice 
+143+52+.--.+(L-—1). We denote the predistorted point by the 





QUADRATURE AMPLITUDE DIGITAL/ANALOG ANALOG/DIGITAL 
MODULATION DATA DATA 
a+ ib=pe!? x +iysre? u + iv= Rel 


(a) (b) (c) 
Fig. 2—The signal constellations at various points in the linearizer of Fig. 1. 


ae amplitude modulation data. (b) Digital/analog data. (c) Analog/digital 
ata. 
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complex number 
x+iy=re", 


A sequence of these points amplitude and phase modulates a carrier at 
frequency f, and the resulting signal, 


S(t) = YL r™p(t — mT)eiereere, (1) 
is applied to the amplifier. In (1), p(t) is a rectangular pulse, 1/T is the 
signaling rate, and r’”, 0° represent the amplitude and phase of the 
mth data symbol. 

The amplifier is customarily represented by a pair of memoryless 
nonlinear functions.’*’” The amplitude function, A(r), causes ampli- 
tude distortion (AM/AM) and the phase function, ®(r), causes ampli- 
tude to phase conversion (AM/PM). Thus the signal, (1), after ampli- 
fication becomes 


So(t) = Y A(r™) p(t — mT el OrletromrarrrD (2) 


Figure 3 shows sketches of typical curves A(r) and ®(r) for a traveling- 
wave tube (TWT) power amplifier. 

We remark that if the functions A(-) and ®(-) were known exactly 
we could choose transformations g(-) and A(-) from points (0, ) to 


45 


0.5 30 


NORMALIZED OUTPUT AMPLITUDE 


OUTPUT PHASE IN DEGREES 


0 0 
0 0.5 1.0 1.5 
NORMALIZED INPUT AMPLITUDE ({r) 


Fig. 3—Typical example of the input/output amplitude and phase characteristics of 
a TWT amplifier. 
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(r, 6) so that 


A(r) = A[g(o)] = Gp, (3) . 
where 
r=g(p) - 
and 
P(r) + A(o, p) =O + § (4) 
where 
6 = hig, p). 


The constant gain, G, in (3) is some desired gain, taking into account 
the linear gain of the power amplifier and the coupling factor of the 
sampling directional coupler. The fixed phase shift, £, in (4) is arbitrary 
and can be set to zero without loss of generality. 

Finally, let the measured data point from the A/D converter be 


u+iv= Re’. 


The object of the predistortion is to find a solution 


r=f 
and 
6 = 6, 
such that 
Re = Gpe”’. (5) 


We now describe an iteration procedure that converges to (5). 

Since data are usually scrambled, a specific data point (p, ¢) will 
occur at random. Let (7, 6,) be the predistorted RAM data point and 
(Rn, tn) be the measured data point at the nth time the desired data 
point (p, ¢) occurs. The measured output radius R,, is then 


Rn = A(rn) + Yn (6a) 
while the measured angle is 
Un = M(rn) + On + Ln, (6b) 


where py, and pu, are zero-mean measurement errors. If the measure- 
ments were perfect and noiseless, we would simply solve the following 
two nonlinear equations 


f(7) = A(r) — Gp = 0 (7) 
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and 
b(7)+6-¢6=0 (8) 


for * and 6. 

A great number of iterative procedures are known for solving the 
nonlinear equations given in (7) and (8). In the presence of measure- 
ment noise ages convex functions A and ®), stochastic approximation 
algorithms” provide efficient methods. So, if one chooses step sizes 
a, and B,, which behave as o(1/n), the following recursions are known 
to converge to the true solutions 7 and 6 in a mean-square sense: 


Tati = Tn — On(Rn — Gp) (9a) 
On+1 = On BrlWn a ). (9b) 


To provide a rationale for the above recursions we analyze, in some 
detail, the behavior of (9a). The behavior of (9b) is similar. Since /(X), 
eq. (7), is continuous on (Xo, Xmax) and if, by hypothesis, the derivative 
f’(X) of f(X) exists on this interval, then there is a € such that 


Xo S € S Xmax 
f(g) =a (10) 
Applying this mean-value theorem to f(rn), eq. (7), we get 
f (rn) = A(tn) — Gp = f(F) + f’(En) (tn — F), (11) 
where €, lies between r, and 7. Substituting this into (9a) we get 
Int — P= Tn — F— an[A'(En) (tn — F) + Pn). (12) 


In (12) we made use of the fact that f(7) = 0 and subtracted 7 from 
both sides of the equation. 
Letting anA’(én) = yn, Tn — F = 6, and then iterating (12) we obtain 


_ Yn¥n 
On+i = =(1- Yn) 8n A’ (En) 
n n Vp 
mee ae eae sa NY 


We now compute the mathematical eee of 6, and the variance: 


On+1 = E {8n+1} = E {63} II (1 == yi) 
<E{dje &" (14) 


and 


2 
itl = EF {6n+1 — Sn+1 a ao eee 1- yi)’, 15 
oe ve oy [A Gra - wee I ”) 
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where o7 = E{v2}. We see from (14) and (15) that rapid convergence 
of the algorithm is critically dependent on the step-size sequence, yn, 
which in turn depends on the derivative of the nonlinearity in the 
neighborhood of the solution. While the derivative is unknown, in most 
applications it can be bounded away from zero and this makes it 
possible to estimate the best step-size sequence. We see from (14) that, 
even in the absence of noise, the algorithm is guaranteed to converge 
to the true solution only if the step-size series diverges, 1.e., 

Y yw = ¥ aA'(&i) > %, n— o, (16) 

i=l i=l 
If a; and A’(é;) are restricted to be positive, (16) is equivalent to 
requiring 7-1 a; to diverge. So it follows that the structure of the 
sequence can be of the form 


n 


where a is a positive constant and 0 S 7 S 1. It can be shown that for 
this choice of a,, and for c = aA’(r), 


en?-” 
E{8n+1} = Say 


E(o) (0<7<1) (18) 


and 
2 
On+1 c 


o = on” (0 <7< 1), (19) 





[A’(A)/? 


where 





ni” fe 31 
= |i a ye 
on = tim [5 ~ 2 aap 
e.g., g(0) = 0 and q(1/2) = 1.46. Table I shows the behavior of the 


statistics for the special cases n = 0 and 7 = 1, where (18) and (19) are 
not applicable. 


Table I—Behavior of statistics for 7 = O and yn = 1 

















8n+1 Or41 ap 
c= aA’(F) 5, G? (AP 
c 
0<c<2 (l-—c¢)"<e"™ 1-(1-c)*"]~ 
; . z—<! Od are 
1 
Sn ~0.577 ;,,¢ c + 
2 ae Se—1 nH 
1 
1 —0.577 = 
e /n 
0.577 In(n) 
1/2 meas 
/ e2/Jn 4n 
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It is noted from eqs. (17) through (19) and Table I that the choice of 
a fixed step size (n = 0) gives the fastest convergence of the mean, 8,41, 
but results in a finite variance, 024;, as mn — ©. On the other hand, a 
variable, progressively smaller step size (7 ¥ 0) gives a slower conver- 
gence of the mean, but results in a variance approaching zero as n > 
oo, Thus, a fixed step size should be used if the measurement error 
variance, o2, is very small, and a variable step size should be used if 
the variance is large. Actually, when finite-precision arithmetic is 
employed, the choice of a progressively smaller step size can lead to 
large errors.’? Thus, in our particular application where the measure- 
ment error can be made small, and where the use of finite-precision 
arithmetic is necessary, a fixed step size is more suitable. 

We note that our algorithm, (9), is based on the polar representation 
of the data symbols, while the hardware implementation of Fig. 1 is 
based on the rectangular representation. Thus, the linearizing proces- 
sor is required to convert back and forth between the two representa- 
tions. One could avoid this conversion by replacing (9) by the rectan- 
gular-based algorithm 


Xnt1 + Wns = Xn + Wyn — (Qn + IBn)[Un + lun — G(a + ib)]. 


Unfortunately, however, the convergence of this. algorithm is not 
guaranteed, even for some well-behaved amplifier characteristics. 


IV. SIMULATIONS 


Here we present the results of computer simulations of the algo- 
rithm, (9), with fixed step sizes and finite-precision arithmetic. The 
precision is limited by the finite number of bits of the D/A and A/D 
converters (Fig. 1). As an example, we consider a TWT amplifier with 
normalized amplitude and phase nonlinearities of the form” 





2r 
Een ie 2 
A(r) tap (20a) 
r 
®(r) = 60° —_—,, (20b) 


which are sketched in Fig. 3. Note that at saturation, r = 1, A(r) = 1, 
and ®(r) = 30°. Figure 4 shows the severe distortion of the output 
constellation obtained with such an amplifier for a 64-QAM input 
signal driven with its corner point at saturation. 

Figures 5, 6, and 7 show the simulated results of the output signal 
constellations after the application of (9) for three different cases 
(explained below). In all cases the amplifier drive is maintained at 
saturation. The four quadrants in each figure represent different 
combinations of the numbers of bits of the D/A and A/D converters, 
as indicated in Table II. 
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Fig. 4—The distortion of the signal constellation obtained with the amplifier of Fig. 
3 for a 64-QAM input signal with its corner points at saturation. 


Figure 5 corresponds to the case of no measurement error, 1.€., Yn = 
Ln = 0 in (6). The step sizes used in (9) are a, = 0.5 and £, = 1.0. Since 
the normalized small-signal gain, A’(r), of the amplifier is about 2, the 
choice of a, = 0.5 results in a value for c of about 1. This results in the 
fastest convergence of (9a), as can be seen from the first row of Table 
I. Similarly, the choice of 8, = 1.0 results in the fastest convergence of 
(9b) since that equation is linear in 8. The initial guesses of r and @ for 
each constellation point in Fig. 5 were chosen at random, i.e., from a 
blind start. The results indicated in the figure are those after 25 
iterations for each constellation point. The point scatter shown is 
entirely due to the finite resolution of the D/A and A/D converters 
since no measurement errors were assumed. 

It is clear from Fig. 5 that the use of 8 bits for both the D/A and 
A/D converters (fourth quadrant) gives quite an acceptable perform- 
ance for 64 QAM. The use of 9 bits for each converter (first quadrant) 
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Fig. 5—Simulated results of the output signal constellation after 25 iterations with no 
measurement errors, and for step sizes a, = 0.5 and £, = 1.0. The number of bits of the 
D/A and A/D converters are different in each quadrant as indicated in Table 2. 


results in an almost perfect output constellation. In general, for L?- 
QAM, where L = 2”, D/A and A/D converters with (M + 5) bits are 
needed for acceptable performance, and (M + 6) or more bits are 
needed for almost perfect performance. 

In Fig. 6, the step sizes employed are a, = 0.5 and Bp = 1.0, as in Fig. 
5. However, a measurement error was introduced that is equivalent to 
a 30-dB signal-to-noise ratio at the corner points of the constellation. 
With the normalization used for A(r) in (20a), this noise corresponds 
to o; = A*(r)oz = 0.0005, where o? and o; are the variances of the 
errors vy, and pin defined in (6). It is clear from Fig. 6 that such a level 
of noise, in combination with the large step sizes employed, gives 
unacceptable results. 

In Fig. 7, the same noise level as that of Fig. 6 is employed. However, 
the step sizes were reduced by a factor of 10, i.e., a, = 0.05 and 2, = 
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Fig. 6—Same as Fig. 5, but with a measurement error equivalent to 30-dB of signal- 
to-noise ratio at the corner points of the constellation. 


0.1, resulting in a greatly improved performance. About 100 steps of 
iteration were needed in this case to reach convergence. Note that in 
spite of the measurement noise, the performance in the third quadrant 
of Fig. 7 is better than the corresponding performance in Fig. 5, where 
no noise is present. This is due to the reduced step sizes, and to the 
fact that the noise tends to smooth out quantization errors. 


V. SUMMARY AND CONCLUSIONS 


We have proposed and analyzed a transmitter-based, adaptive 
linearizer, which automatically compensates for power amplifier non- 
linearity in digital radio systems employing multilevel quadrature 
amplitude modulation. The linearizer utilizes a real-time, data-directed 
recursive algorithm for predistorting the signal constellation. The 
algorithm is robust and results in rapid convergence, even from a blind 
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Fig. 7—Same as Fig. 6, but with the step sizes reduced by a factor of 10, i.e., an = 0.05 
and £, = 0.1. About 100 iteration steps were needed for convergence. 


Table II—Number of bits of the D/A 
and A/D converters for the four 
quadrants of Figs. 5, 6, and 7 


Number of Bits Number of Bits 
Quadrant of the D/A of the A/D 


Number Converter Converter 
1 9 9 
2 9 7 
3 8 6 
4 8 8 





start. The digital nature of the linearizer allows both the signal con- 
stellation and the average transmitted power to be changed through 
software. 

The proposed linearizer is only suitable for amplifiers with essen- 
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tially memoryless and single-valued nonlinearities. Consequently, 
pulse shaping is not permitted before amplification. In our approach 
the task of pulse shaping is relegated to the combination of the 
transmit RF filter, a receiver-based filter, and an automatic equalizer. 
The latter is presumed to be required anyway to deal with multipath 
fading, especially for signals having a large number of levels. 

Our principal conclusion is that high-power RF amplifiers can op- 
erate at saturation, where they are most efficient, provided that an 
adaptive predistortion linearizer is used. We have demonstrated that 
such a system is feasible and have provided a theory and a methodol- 
ogy for assessing its performance. 
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In this paper we present several of the salient theoretical and 
practical issues associated with modeling a speech signal as a prob- 
abilistic function of a (hidden) Markov chain. First we give a concise 
review of the literature with emphasis on the Baum-Welch algorithm. 
This is followed by a detailed discussion of three issues not treated in 
the literature: alternatives to the Baum-Welch algorithm; critical 
facets of the implementation of the algorithms, with emphasis on 
their numerical properties; and behavior of Markov models on certain 
artificial but realistic problems. Special attention is given to a par- 
ticular class of Markov models, which we call “left-to-right”’ models. 
This class of models is especially appropriate for isolated word 
recognition. The results of the application of these methods to an 
isolated word, speaker-independent speech recognition experiment 
are given in a companion paper. 


I. INTRODUCTION 


It is generally agreed that information in the speech signal is encoded 
in the temporal variation of its short-duration power spectrum. To 
decode the signal, then, requires techniques for both estimation of 
power spectra and tracking their changes in time. This paper is 
concerned with the application of the theory of probabilistic functions 
of a (hidden) Markov chain to modeling the inherent nonstationarity 
of the speech signal for the purposes of automatic speech recognition 
(ASR). 

The use of hidden Markov models for ASR was proposed by Baker’” 
and, independently, by a group at IBM.*” The theory on which their 
work rests is due to Baum et al.’*""’ Its first appearance in the literature 
occurred several years before Baker’s studies and has since been 
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explored in some detail.’*’® Our previous work in ASR has used 
temporal alignment procedures based on dynamic programming tech- 
niques,” and we hoped that through studying the new (to us) body of 
material we could improve the performance and/or capabilities of our 
present ASR systems. 

Our initial goal, therefore, was to understand the theory of hidden 
Markov models sufficiently well to enable us to implement a new ASR 
system that could be compared directly to our existing ones. We have, 
in fact, been able to accomplish that goal, and a description and the 
results of our experiments are reported in a companion paper.” In the 
course of our studies, we have collected and integrated a number of 
loosely related mathematical techniques pertinent to Markov model- 
ing. We have also modified and adapted these techniques to the 
specific ASR problems we wished to study. Our purpose in writing this 
tutorial, then, is to present this synthesis in a way that will be 
enlightening to those not familiar with the theory of hidden Markov 
models. We also hope that this treatment will provide for a better 
understanding of our companion paper. Finally, we hope to make the 
presentation general enough so that the theory is seen to be applicable 
to more than the problem of ASR. 

We shall proceed as follows. We begin by defining probabilistic 
functions of a (hidden) Markov” chain and then show how they may 
be used in a natural way to model the speech signal. Once this is done, 
our task is reduced to solving two specific and well-defined mathemat- 
ical problems: (i) computing the parameters of a proposed model 
conditioned on a sequence of observations assumed to have been 
generated by the model, and (ii) calculating the probability that a 
given set of observations was produced by a particular model. 

First we review the solution to these problems as originally given by 
Baum,” who treated them as problems in statistical estimation. Lest. 
the problems be too narrowly construed, we look at them as problems 
of classical constrained optimization. This allows us to give a partial 
geometrical interpretation to the Baum-Welch algorithm and to relate 
it to other studies of the problem by Baum and Eagon™ and Baum 
and Sell.’’ It also makes clear that there are other methods of solution 
available that may, in certain instances, have advantages over the 
Baum-Welch algorithm. Finally, we discuss the dynamic programming 
algorithm of Viterbi” as an alternative to the so-called “forward- 
backward” method of Baum” for computing the probability of a 
sequence of observations conditioned on a specific model. 

The treatment of these problems in the existing literature, and as 
recounted here, can lull a prospective user of the theory into a false 
sense of security. The equations look innocuous enough but, in reality, 
they overlook two problems that, though uninteresting from a theo- 
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retical standpoint, are of great significance for a robust implementa- 
tion. We believe it is worthwhile to address, first, a numerical problem 
arising from the evaluation of certain frequently occurring algebraic 
expressions, and then an experimental difficulty precipitated by the 
inescapable reality of finite training-data set. 

The numerical problem arises because, regardless of the method of 
solution chosen, one is required to evaluate a product of stochastic 
matrices involving a number of factors proportional to the number of 
observations. In any real computer, this will ultimately result in 
underflow. Fortunately, the computation can be scaled using a tech- 
nique that subsequently will be seen to have some very useful prop- 
erties. 

The problem of insufficient training data can be ameliorated by 
changing the constraints on the optimization problem. This can be 
simply and directly accomplished in the classical methods. We show 
that the Baum-Welch algorithm, too, can be modified to produce the 
same result. Both of these methods appear to be simpler in implemen- 
tation than the technique proposed by Jelinek and Mercer.® 

Finally, under the heading of implementational considerations, we 
discuss techniques for model averaging. These can be used both for 
block processing of observations in case one is subject to storage 
limitations, and for increasing model stability under some circum- 
stances. 

The speech recognition experiment that we had in mind was on a 
speaker-independent, isolated word recognition system with a small 
vocabulary. Oddly enough, this is a simpler task than those to which 
the theory had already been applied by Baker’ and the IBM group.*” 
Perhaps our choice of a manageable problem is responsible for the 
degree of success reported in the companion paper.” We determined 
that for our ASR task it is advantageous to use a particular kind of 
hidden Markov model, which we call a left-to-right model. In such a 
model there are strong temporal constraints on the Markov chain. 
First, any state, once left, cannot be later revisited. Second, there is a 
final absorbing state in which all observation sequences are assumed 
to terminate. These restrictions on the sequences affect both parame- 
ter estimation and probability computation procedures. We show how 
the Baum-Welch algorithm, the Viterbi algorithm, and the classical 
methods can all be adapted for use with left-to-right models. 

We conclude our presentation with several sample solutions to some 
artificial but nontrivial problems that illustrate concepts treated in the 
foregoing discussion. 


ll. A REVIEW OF THE THEORY 
A probabilistic function of a (hidden) Markov chain is a stochastic 
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process generated by two interrelated mechanisms, an underlying 
Markov chain having a finite number of states, and a set of random 
functions, one of which is associated with each state. At discrete 
instants of time, the process is assumed to be in some state and an 
observation is generated by the random function corresponding to the 
current state. The underlying Markov chain then changes states ac- 
cording to its transition probability matrix. The observer sees only the 
output of the random functions associated with each state and cannot 
directly observe the states of the underlying Markov chain; hence the 
term hidden Markov model. 

In principle, the underlying Markov chain may be of any order and 
the outputs from its states may be multivariate random processes 
having some continuous joint probability density function. In this 
discussion, however, we shall restrict ourselves to consideration of 
Markov chains of order one, i.e., those for which the probability of 
transition to any state depends only upon that state and its predeces- 
sor. We shall also limit the discussion to processes whose observations 
are drawn from a discrete finite alphabet according to discrete proba- 
bility distribution functions associated with the states. 

It is quite natural to think of the speech signal as being generated 
by such a process. We can imagine the vocal tract as being in one of a 
finite number of articulatory configurations or states. In each state a 
short (in time) signal is produced that has one of a finite number of 
prototypical spectra depending, of course, on the state. Thus, the 
power spectra of short intervals of the speech signal are determined 
solely by the current state of the model, while the variation of the 
spectral composition of the signal with time is governed predominantly 
by the probabilistic state transition law of the underlying Markov 
chain. For speech signals derived from a small vocabulary of isolated 
words, the model is reasonably faithful. The foregoing is, of course, an 
oversimplification intended only for the purpose of motivating the 
following theoretical discussion. 

Let us say that the underlying Markov chain has N states qi, go, 

+, gn and the observations are drawn from an alphabet, V, of M@ 
prototypical spectra, U1, V2, +--+ , Uw. The underlying Markov chain can 
then be specified in terms of an initial state distribution vector 7’ = 
(71, 72, +++, wn) and a State transition matrix, A = [aj] 1<i1j< N. 
Here, 7; is the probability of g; at some arbitrary time, ¢ = 0, and aj; is 
the probability of transiting to state q; given current state, qi, that is 
ai; = prob(q; at £ + 1|q; at t). 

The random processes associated with the states can be collectively 
represented by another stochastic matrix B = [b;,] in which for 1 <7 
= Nand1=k<M, b;z is the probability of observing symbol vu; given 
current state g;. We denote this as bj, = prob(v, at t|q; at t). Thus a 
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hidden Markov model, M, is identified with the parameter set 
(7, A, B). 

To use hidden Markov models to perform speech recognition we 
must solve two specific problems: observation sequence probability 
estimation, which will be used for classification of an utterance; and 
model parameter estimation, which will serve as a procedure for 
training models for each vocabulary word. Both problems proceed 
from a sequence, O, of observations O,O2 --- Or where each QO; for 
1stsTissome vu, € V. 

Our particular classification problem is as follows. We wish to 
recognize utterances known to be selected from some vocabulary, W, 
of words wi, W2, --- , Wy. We are given an observation sequence, O, 
derived from the utterance of some unknown w; € W and a set of V 
models M;, Mp, --- , Mv. We must compute P; = prob(O|M;) for 1 < 
t= V. We will then classify the unknown utterance as w; iff P; = P; for 
l=Js V. 

The training problem is simply that of determining the models 
M; = (7:, Ai, B:) for 1 < i = V given training sequences O”, 0”, 
-»»,O™, where O” is known to have been derived from an utterance 
of word w; for 1 sis V. 

One could, in principle, compute prob(O|M) by computing the joint 
probability prob (O, s|M) for each state sequence, s, of length 7, and 
summing over all state sequences. Obviously this is computationally 
intractable. Fortunately, however, there is an efficient method for 
computing P. Let us define the function a;(z) for 1 = ¢< T as prob(O,O2 
--» O; and q; at t}|M). According to the definition a,(i) = 7;b;(O), 
where 06;(O,) is understood to mean bj, iff O, = u;; then we have the 
following recursive relationship for the “forward probabilities”: 

N 


ari(J) = b alia b;(Or+1) 1stsT-1. (1) 
i=l 
Similarly, we define another function, B:(7) = prob(Ois1O;+2 -++> 
Or|q; at ¢ and M). We set Br(j) = 1 V Jj and then use the backward 
recursion 
N 


B:(z) = ¥; aij bj (Or41) Bea (7) T-l2=t2=1 (2) 


J=1 


to compute the “backward probabilities.” 
The two functions can be used to compute P according to 


N WN 
P = Prob(O|M) = YY ali) aybj(Ore) Bess J) (3) 


w=1 7=1 


for any ¢ such that 1 < ¢ <= T — 1. Equations (1) to (3) are from Baum™ 
and are sometimes referred to as the “forward-backward” algorithm. 
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Setting ¢ = T — 1 in (8) gives 
N 
P=} ar(i) (4) 
i=] 
so that P can be computed from the forward probabilities alone. A 
similar formula for P can be obtained from the backward probabilities 
by setting ¢ = 1. These and several other formulas in this section may 
be compactly written in matrix notation (see Appendix A). For in- 
stance, 


P=17'B,\AB,A --- ABrl, (5) 
where 1 is the N-vector (1, 1, 1, --- , 1)’ and 
bi(Or) 
B= es (6) 
o * by(0, 


for 1<t< T. From (5) it is clear that P is a homogeneous polynomial 
in the 7;, aij, and bj. Any of eqs. (3) through (5) may be used to solve 
the classification problem. The forward and backward probabilities 
will prove to be convenient in other contexts. 

When we compute P with the forward-backward algorithm, we are 
including the probabilities of all possible state sequences that may 
have generated O. Alternatively, we may define P as the maximum 
over all state sequences 1 = &, li, +++ ir of the joint probability 
P(O, i). This distinguished state sequence and the corresponding 
probability of the observation sequence can be simultaneously com- 
puted by means of the Viterbi” algorithm. This dynamic programming 
technique proceeds as follows: Let ¢:(t) = 7b;(O;) for 1 =i = N. Then 
we can perform the following recursion for2=t< Tandl=sj=sN 


o(J) = max [e-1(¢) ai; ]0;(Or) (7a) 
and 
WJ) = 1", (7b) 


where i* is a choice of an index 7 that maximizes ¢;-1(1). 
The result is that P= max [¢7r(z)]. Also the maximum likelihood 
1<i=N 


state sequence can be recovered from wy as follows. Let gr = 1*, where 
t* maximizes P. Then for T' => t= 2, q:r-1 = ¥(qz). If one only wishes to 
compute P, the linked list, Y, need not be maintained as in (7b). Only 
the recursion (7a) is required. 

The problem of training a model, unfortunately, does not have such 
a simple solution. In fact, given any finite observation sequence as 
training data, we cannot optimally train the model. We can, however, 
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choose 7, A, and B such that prob(O|M) is locally maximized. For an 
asymptotic analysis of the training problem the reader should consult 
Baum and Petrie.” 

We can use the forward and backward probabilities to formulate a 
solution to the problem of training by parameter estimation. Given 
some estimates of the parameter values we can compute, for example, 
that the expected number of transitions, y;;, from q; to g;, conditioned 
on the observation sequence is just 


1 T-1 . : 
y= P by a(t) ayb;(Ors1) Beri(J). (8) 
t=1 
Then, the expected number of transitions, y; out of g:, given O, is 
y i eee ee 
y= > w= Dp > ar(t) B:(z), (9) 
j=l t=1 


the last step of which is based on (2). The ratio y;/y:i is then an 
estimate of the probability of state g;, given that the previous state 
was qi. This ratio may be taken as a new estimate, @;;, of a;;. That is, 


T-1 
= ys a(t) aiybj(Or+1) Besi(7) 
gy = 2 =. (10) 
¥ Y ali) Br(i) 


t=1 


Similarly, we can make a new estimate of 5, as the frequency of 
occurrence of vu; in q; relative to the frequency of occurrence of any 
symbol in state q;. Stated in terms of the forward and backward 
probabilities we have 

y ac(7Z) Be(7) 
b, = 4] (11) 
dy ae(7) Be(J) 
t=1 
Finally, new values of the initial state probabilities may be obtained 
from 


fi = 5 onli) Bali). (12) 


As we shall see in the next section, the reestimates are guaranteed 
to increase P, except at a critical point. 


2.1 Proof of the reestimation formula 


The reestimation formulas (10), (11), and (12) are instances of the 
Baum-Welch algorithm. Although it is not at all obvious, each appli- 
cation of the formulas is guaranteed to increase P except if we are at 
a critical point of P, in which case the new estimates will be identical 


SPEECH RECOGNITION 1041 


to their current values. Several proofs of this rather surprising fact are 
given in the literature.*’” Because we shall need to modify it later to 
cope with the finite sample size problem, we shall briefly sketch 
Baum’s proof” here. The proof is based on the following two lemmas: 
Lemma 1: Let u;,i = 1, ---, S be positive real numbers, and let v;, 
i=1,---,S be nonnegative real numbers such that Yi); v; > 0. Then 
from the concavity of the log function it follows that 


> vu; = Uj Ui 
+ (Fe) -» | (Fa) 
uj; vi 
=I (7) 
= ede b (uln v; — ujln u) | (13) 


» Ur 


k 








Here every summation is from 1 to S. 


Lemma 2: If ¢, > 0 i = 1, --+, N, then subject to the constraint 
Yi x; = 1, the function 


F(x) = ¥ ciln x; (14) 

attains its unique global maximum when 
xi = Fa (15) 
The proof follows from the Secsian that by the Lagrange method 
me [Fe -AEx|=2-awo. (16) 


Multiplying by x; and summing over i gives\ = » ci, hence the result. 


Now in Lemma 1, let S be the number of state sequences of length 
I’. For the ith sequence let u; be the joint probability 


u; = Prob[state sequence i, observation O | model M] 
= P(i, O|M). 
Let uv; be the same joint probability conditioned on model M. Then 
» ui = p(O|M) & P(M) 
2 vs = p(O|M) & P(M) (17) 
and the lemma gives 


1042 THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1983 


P(M)_ 1 


mn P(M)— = P(M) 


- [Q(M, M) — Q(M, M)], (18) 


where 


Q(M, M) 4 Y win uw. (19) 


Thus, if we can find a model M that makes the right-hand side of (18) 
positive, we have a way of improving the model M. Clearly, the largest 
guaranteed improvement by this method results for M, which maxi- 
mizes @(M, M) [and hence maximizes the right-hand side of (18)]. The 
remarkable fact proven in Ref. 13 is that Q(M, M) attains its maximum 
when M is related to M by the reestimation formulas (10) through (12). 
To show this let the sth-state sequence be So, si, --+ , sr, and the given 
observation sequence be O,;,, --- , Oz,. Then 

T-1 


a > In bs,,,(Or+1). (20) 
t=0 


St41 


T-1 
In vs; = In p(s, O|M) = Inz,,+ ¥ InG,, 
t=0 


Substituting this in (19) for Q@(M, M), and regrouping terms in the 
summations according to state transitions and observed symbols, it 
can be seen that 


N M 
@(M, M) = , 5 cyln aij + 2 » djzln 6;() 


i=1 j=1 : 
+ ¥ eln 7. (21) 
Here 

-5 p(s, O|M)ni;(s) (22a) 

Ss 
dx = X p(s, O|M)ma(s) (2b) 

Ss 
= 2 pls,O|M)ri(s), (22c) 


and for the sth-state sequence 
ni(s) = number of transitions from state q; to q; 
mjx(s) = number of times symbol k is generated in state q; 
ri(s) = 1 if initial state is gi 
= 0 otherwise. 
Thus, c;;, djx, and e; are the expected values of ni;, m;z, ri, respectively, 


based on model M. 
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The expression (21) is now a sum of 2N + 1 independent expressions 
of the type maximized in Lemma 2. Hence, Q(M, M) is maximized if 








aij y 3 (23a) 
7 djs 
bk) = (23b) 
ti = Fa (23c) 


These are recognized as the reestimation formulas. 


2.2 Solution by optimization techniques 


Lest the reader be led to believe that the reestimation formulas are 
peculiar to stochastic processes, we shall examine them briefly from 
several different points of view. Note that the reestimation formulas 
update the model in such a way that the constraints 


N 
>, m=1 (24a) 
i=] 
N 
YY ajy=1 for 1=i=WN (24b) 
j=1 
and 
M 
y be=1 for 1<j<N (24¢) 
k=1 


are automatically satisfied at each iteration. The constraints are, of 
course, required to make the hidden Markov model well defined. It is 
thus natural to look at the training problem as a problem of constrained 
optimization of P and, at least formally, solve it by the classical method 
of Lagrange multipliers. For simplicity, we shall restrict the discussion 
to optimization with respect to A. Let @ be the Lagrangian of P with 
respect to the constraints (24b). We see that 


N N 
q=P+3X(Z a1), (25) 
i=l j=l 


where the ); are the as yet undetermined Lagrange multipliers. 
At a critical point of P on the interior of the manifold defined by 
(24a through c), it will be the case that for 1 <1, 7=N 
aQsaP 


+A; = 0. (26) 
OQi; = 0A; 
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Multiplying (26) a ai; and summing over 7 we get 


aP _ oP 
ajy;— L di = —\; = 27 
» 7 Bae” -|5 ay dai; ( ) 
where the right-hand side of (27) follows from substituting (24b) for 
the sum of a;; and then replacing A; according to (26). From (27) it 
may be seen that P is maximized when 


ai; = (28) 


N 
Ph oan 
A similar argument can be made for the 7 and B parameters. 

While it is true that solving (28) for a;; is analytically intractable, it 
can be used to provide some useful insights into the Baum-Welch 
reestimation formulas and alternatives to them for solving the training 
problem. Let us begin by computing dP/da;; by differentiating (3), 
according to the formula for differentiating a product, 


apt 
ja, CHE (Orr1) Bro). (29) 
tJ t= 


Substituting the right-hand side of (29) for @P/da;; in (28), we get 


T-1 
by ar(t) aijbj(Or+1) Beil) 
Fe = ene a (30) 
bs >, a(t) aijbj(Or+1) Brai(7) 
j=l t=1 
Then changing the order of summation in the denominator of (30) and 
substituting in the right-hand side of (2) we get 
T-1 
Yy ad(t) aibj(Ors1) Biti(7) 
Qi; = FR ee eee ee eee (31) 


Y a(t) Br(z) 

t=1 
The right-hand side of (31) is thus seen to be identical to that of the 
reestimation formula (10). Thus, at a critical point, the reestimation 
formula (10) solves the equations (31). Similarly, if we differentiate (3) 
, with respect to 7; and bj, we get 


aP 
Be a > bi(O1)aijzbj(O2) B2(7) 


OT; = j= 


= b;(O1) Bi(z) (32) 


SPEECH RECOGNITION 1045 


and 


N 
PF = YY euli)aiBeals) + 6(0r, vm), (33) 
Objk — t3Or=un i=1 
respectively. In (33) 6 is understood to be the Kronecker 6 function. 
By substituting (32) and (33) into their respective analogs of (28), 
we obtain the reestimation formulas (12) and (11), respectively, at a 
critical point. Thus it appears that the reestimation formulas may 
have more general applications than might appear from their statistical 
motivation. 
Equation (28) suggests that we define a transformation, T, of the 
parameter space onto itself as 


T(x)ij = eee ao (34) 


where T(x);; is understood to mean the 7, jth coordinate of the image 
of x under T. The parameter space is restricted to be the manifold 
such that x;;= 0 for 1 <i, 7= Nand), x; = 1 for 1<i< WN. Thus, 
the reestimation formulas (10), (11), and (12) are a special case of the 
transformation (34), with P a particular homogeneous polynomial in 
the x;; having positive coefficients. Here the x;; include the 7;, the a;;, 
and the b;,. Baum and Eagon™ have shown that for any such polyno- 
mial P[T(x)] > P(x) except if x is a critical point of P. Thus the 
transformation, T, is appropriately called a growth transformation. 
The conditions under which T is a growth transformation were relaxed 
by Baum and Sell?’ to include all polynomials with positive coefficients. 
They further proved that P increases monotonically on the segment 
from x to T(x). Specifically, they showed that P[nT(x) + (1 — y)x] = 
P(x) for 0 <= 7 S 1. Other properties of the transformation (34) have 
been explored by Passman™ and Stebe.’? There may be still less 
restrictive general criteria on P for T to be a growth transformation. 
We can give T(x) a simple geometric interpretation. For the purposes 
of this discussion we shall restrict ourselves to x € R¥, x;=0 for1s 
i < N, and the single constraint G(x) = Yi x; — 1 = 0. We do so 
without loss of generality, since constraints such as those of (24a, b, 
and c) are disjoint, i.e, no pair of constraints has any common 
variables. As shown in Fig. 1, given any x satisfying G(x) = 0, T(x) is 
the intersection of the vector X, or its extension, with the hyperplane 


aP 

y1 x; — 1 = 0, where X has components x; Fi 
Xi 

This may be shown by observing that a line in the direction of X 

passing through the origin has the equation y = rX, where r is a non- 


forl=i=N. 
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BMT) + [1-pe(mlx 
/ 


n —_ —_ 
Gix) = ) xj -1= 0-2, (x + 9X) 
i=1 


O< 7/2 
DIR[TIx) -x] 


x4 


Fig. 1—Geometrical relationship of the quantities involved in the reestimation for- 
mulas. 


negative scalar. Component-wise this is equivalent to 


\P 
ra for 1Si=N. (35) 
OX: 
We can find that r for which y intersects the hyperplane G(x) = 0 by 
summing over z. Thus 


= i—=l, 36 
uy ta OXi 6) 


since y lies on the hyperplane G(x) = 0. Rearranging (36) we have 
1 
Xi 


i=l OX 





and 
aP 
Xi ax: 
y= oe (38) 
aP 
yx 
jal OX; 
Furthermore, as also shown in Fig. 1, the vector [T(x) — x] is the set 
of intersections of the vector (x + 7X) with the hyperplane G(x) = 0 
for 0 = 7 = + with T(x) corresponding to 7 = +o and x to 7 = 0. 
Finally, in view of the result of Baum and Sell, quoted above, the 
vector T(x) — x must also have a positive projection on VP. This, 
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too, is easily seen. If P is a polynomial with positive coefficients, then 
oP/ax; = 0 for 1 = is N. From the definition of T it is clear that 
aP_N oP 
_—_ 


p 


= x= Fr 
OXi ja. OX; 


T(x): =x: Uf r, (39) 


where r is some constant. Then it must be true that 


e aP 
¥ (T(x): — xi] (= _ r) =0 (40) 
i=1 OX; 
since both factors in each summand are of the same sign. Rearranging 
(40) we have 


N oP N 

Y, [T(x); — x] —2r ¥ [T(x)i -— xi] = 0. (41) 
i=1 OXi i=1 

The right-hand side is zero since }%, T(x); = YNix; = 1. Thus 
[T(x) — x]-VP = 0, proving that a step of the transformation has a 
positive projection along the gradient of P. 

This merely guarantees that we can move an infinitesimal amount 
in the direction of [T(x) — x] while increasing P. The theorem of 
Baum and Eagon, however, guarantees much more, namely that we 
can take a finite step and be assured of increasing P. We may, in fact, 
be able to continue past T(x) while still increasing P. We are unable, 
at present, to give a geometrical interpretation of this fact. 

While the reestimation formulas provide an elegant method for 
maximizing P, their success depends critically on the constraint set 
(24a, b, and c). As we will suggest later, in some cases there may be 
advantages in using classical optimization methods. 

The principle of the classical methods is to search along the projec- 
tion of VP on the constraint space, G, for a local maximum. The 
method of Rosen,” for example, uses only VP and a crude search 
strategy. The method of Davidon is one of many quasi-Newton tech- 
niques that uses the Fletcher-Powell” approximation to the inverse of 
the Hessian of P and an exact line search with adaptive step size. 
There are many collections of general purpose subroutines for con- 
strained optimization that can be used to solve the training problem. 
We have successfully used a version of the Davidon procedure from 
the Harwell Subroutine Library.” However, for the constraints that 
a, A, and B be stochastic, the computation can be greatly simplified. 

We illustrate this by outlining the gradient search algorithm for the 
case where P is a function of the variables x1, --+ , x» subject to the 
constraints x; = 0 for 1 <i< N and Y%, x; — 1 = 0. For convenience 
we will call the last constraint Gi, and the inequality constraints on x, 
-++, Xn as Go, ->- , Gnii, respectively. 

An initial starting point x is chosen and the “active” constraints 
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identified. For our case G is always active. For i > 1, G; is active if 
xi-1 = 0. Let G,,, 7 = 1, --+ , @ be the active constraints (with m = 1) 
at the initial point. Let @ = P + Y¥j-1AjGn,. Then according to the 
Kuhn-Tucker theorem,”’ the Lagrangian multipliers, \;, are deter- 
mined by demanding that VQ be orthogonal toVG,,, for 1 = 7 = 4 Now 


¢ 
VQ =VP + ¥ ANG, 
j=l 


=VP+T), (42) 


where [is the N X ¢ matrix with [';; = (VG,,): = 8G,,/dx:, and A is the 
vector with components A; for 7 = 1, --- , 2 Thus the Kuhn-Tucker 
requirement is equivalent to 


I’'VQ =0 (43) 
or, from (42), 
A= —-('T)UI’VP. (44) 
For our special constraints we have | 
Ta=1 for 1s=71sN (45) 
and, for 7 ¥ 1 
_ fl if t=n;-1 
ae f otherwise. (46) 
With I defined this way 
N 1 1 1 
= 
rT =|{ 1 1 O (47) 
: oO °. 
1 1 


and (I’T)~' may be shown to be 


| =A =] «ase ah 
1 =1 N=<¢ 1 see 1 
(’T)7* =——__—_|_ -1 1 N-?¢ 1 1 . (48) 
N- C+ 1 : . 1 
el 1 1 N-@ 


Substituting (48) into (44) gives A. When this A is substituted back into 
(42), it turns out that the resulting vector VQ can be computed by the 
following simple steps: 

(t) Compute VP and let S be the sum of all components of VP 
except (VP)n_J = 2, +++, 2 
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(it) Then 
(VQ): = 0 i= nj, J=2,---,¢ 
S 


= (VP); -—————_ ise. 
(VP) ere otherwise 


Finally, the values of P are searched along the line 


vq 
x(n) =x+7 [vol (49) 
for a maximum with respect to 7. The procedure is repeated at this 
new point. 

In applying this technique to the actual training problem, there will 
be 2N + 1 stochasticity constraints analogous to G, and a correspond- 
ing number of positivity constraints analogous to Gz, Gs, --- Gnii. In 
this case we have the option of treating all the parameters and their 
associated constraints together, or we may divide them into disjoint 
subsets and determine search directions for each subset independently. 

Notice that this derivation does not require P to be of any special 
form. This may prove to be an advantage since the Baum-Welch 
algorithm is not applicable to all P. Furthermore, the constraints may 
be changed. Although, as we shall see later, the Baum-Welch algorithm 
can be somewhat generalized in this respect, it does not generalize to 
work with arbitrary linear constraints. 


Ill. CONSIDERATIONS FOR IMPLEMENTATION 


From the foregoing discussion it might appear that solutions to the 
problems of hidden Markov modeling can be obtained by straightfor- 
ward translation of the relevant formulas into computer programs. 
Unfortunately, for all but the most trivial problems, the naive imple- 
mentation will not succeed for two principal reasons. First, any of the 
methods of solution presented here for either the classification or the 
training problem require evaluation of a,(i) and B;(t) for 1 = ¢ = T and 
1<i<N. From the recursive formulas for these quantities, (1) and 
(2), it is clear that as T’' > ©, ar(i) — 0, and fi(i) — 0 in exponential 
fashion. In practice, the number of observations necessary to ade- 
quately train a model and/or compute its probability will result in 
underflow on any real computer if (1) and (2) are evaluated directly. 
Fortunately, there is a method for scaling these computations that not 
only solves the underflow problem but also greatly simplifies several 
other calculations. 

The second problem is more serious, more subtle, and admits of a 
less gratifying, though still effective, solution. Baum and Petrie’ have 
shown that the maximum likelihood estimates of the parameters of a 
hidden Markov process are consistent estimates (converge to the true 
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values as T —> ©) of the parameters. The practical implication of the 
theorem is that, in training, one should use as many observations as 
possible which, as we have noted, make scaling necessary. In reality, 
of course, the observation sequence will always be finite. Then the 
following situation can arise. Suppose a given training sequence of 
length T results in b;, = 0. (It is, in fact, possible for a local maximum 
of P to lie on a boundary of the parameter manifold.) Suppose further 
that we are subsequently asked to compute the probability that a new 
observation sequence was generated by our model. Even if the new 
sequence was actually generated by the model, it can be such that 
a-1(t)ai; is nonzero for only one value of 7 and that O; = uz, whence 
aj) = 0 and the probability of the observation then becomes zero. 
This phenomenon is fatal to a classification task; yet, the smaller T is, 
the more likely is its occurrence. Jelinek and Mercer® have dealt with 
this problem in a slightly different context. Here, we offer the much 
simpler solution of constraining the parameter values so that x;; = €:; 
> 0. 

Finally, in this section we discuss the related problem of model 
stability. Baum and Eagon™ note that successive applications of the 
reestimation formulas converge to a connected component of the local 
maximum set of P. In case there are only a finite number of such 
extrema, the point of convergence is unique to within a renaming of 
the states. The component of the local maximum set to which the 
iteration converges as well as which of the N! labelings of the states is 
determined by the initial estimates of the parameters. If we wish to 
average several models resulting from several different starting points 
to achieve model stability, we must be able to match the states of 
models whose states are permuted. We have devised a solution to this 
problem based on a minimum-weight bipartite matching algorithm.” 


3.1 Scaling 


The principle on which we base our scaling is to multiply a;(z) by 
some scaling coefficient independent of 7 so that it remains within the 
dynamic range of the computer for 1 = t = T. We propose to perform 
a similar operation on £;(i) and then, at the end of the computation, 
remove the total effect of the scaling. 

We illustrate the procedure for (10), the reestimation formula for 
the state transition probabilities. Let a;(z) be computed according to 
(1) and then be multiplied by a scaling coefficient, c:, where say, 


N -1 
c= b at | (50) 


i=1 


so that yea crar(t) = 1 for 1 = t Ss T. Then, as we compute £,(z) from 
(2), we form the product c,6,(i) for T= t=1and1<si< N. In terms 
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of the scaled forward and Dackware probabilities, the right-hand side 
of (10) becomes 


T-1 
YX Crae(t) aij (Ors) Bera(J) Derr 
as (51) 
x »? Crar(t) aizbAOr+1) Br+1(@) Desi 
t=1 f=1 
where 
t 
C.= I c, (52) 
T=1 
and 
T 
D, = I C; 


This results from the individual scale factors being multiplied together 
as we perform the recursions of (1) and (2). 

Now note that each summand in both the numerator and the 
denominator has the coefficient C,D,.1 = [[7-1 c,. These coefficients 
can be factored out and canceled so that (51) has the correct value a;; 
as specified by (10). The reader can verify that this technique may be 
equally well applied to the reestimation formulas (11) and (12). It 
should also be obvious that, in practice, the scaling operation need not 
be performed at every observation time. One can use any scaling 
interval for which underflow does not occur. In this case, the scale 
factors corresponding to values of ¢ within any interval are set to unity. 

While the above described scaling technique leaves the reestimation 
formulas invariant, (3) and (4) are still useless for computing P. 
However, log P can be recovered from the scale factors as follows. 
Assume that we compute c; according to (50) for ¢= 1, 2, --- T. Then 


N 
Cr ¥ ari) =1 (53) 
i=l 
and from (58) it is obvious that Cr = 1/P. Thus, from (52) we have 
T 
I C= (54) 
t=1 


The product of the individual scale factors cannot be evaluated but we 
can compute 


T 
log P = —Y} log c. (55) 
t=1 


If one chooses to use the Viterbi algorithm for classification, then 
log P can be computed directly from 7, A, and B without regard for 
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the scale factors. Initially, we let 4:(i) = log[;b;(O,)] and then modify 
(7a) so that 


od j) = max[dex(¢) + log aij] + loglb;(O:)]. (56) 


In this case log P = max [¢r(t)]. 
1<1<N 


If the parameters of the model are to be computed by means of 
classical optimization techniques, we can make the computation better 
conditioned numerically by maximizing log P rather than P. The 
scaling method of (50) makes this straightforward. 

First note that if we are to maximize log P, then we will need the 
partial derivatives of log P with respect to the parameters of the 
model. So, for example, we will need 


1 oP aP 
log P) = — — = Cr — . 57 
0Qi; Moe) P dai; Cr 0Qi; wn 
Substituting the right-hand side of (29) for dP/da;; in the right-hand 
side of (57) yields 





3 T-1 . : 
aoe (log P) = Cr ¥ a(t) bj(Or41) Beri (J) 
Ai; t=1 
T-1 


x Crar(t) bj(Or+1) Bei(J) Derr 


T-1 t T 
> (11 c) ate(t) b;(Oz41) Bisi(J) ( II e). (58) 
t=1 t=1 T=t+1 

So that if we evaluate (29) formally, using not the true values of the 
forward and backward probabilities but the scaled values, then we will 
have the correct value of the partial derivatives of log P with respect 
to the transition probabilities. A similar argument can be made for the 
other parameters of the model and, thus, the scaling method of (50) 
provides a means for the direct evaluation of Vilog P), which is 
required for the classical optimization algorithms. Later we shall see 
that the combination of maximizing log P and this scaling technique 
simplifies the solution of the left-to-right Markov modeling problem 
as well. 


3.2 Finite training sets 


We now turn our attention to solving the problems created by finite- 
training-set size. As we noted earlier, the effect of this problem is that 
observation sequences generated by a putative model will have zero 
probability conditioned on the model parameters. Since the cause of 
the difficulty is the assignment of zero to some parameters, usually 
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one or more symbol probabilities, it is reasonable to try to solve the 
problem by constraining the parameters to be positive. 

We can maximize P subject to the new constraints a;; = € > 0; 
bj. = € > 0, most easily using the classical methods. In fact, the 
algorithm described earlier based on the Kuhn-Tucker theorem is 
unchanged except that the procedure for determining the active con- 
straints is based on € rather than zero. . 

While the Lagrangian methods are perfectly adequate, it is also 
possible to build the new constraints into the Baum-Welch algorithm. 
We can show how this is done by making a slight modification to the 
proof of the algorithm given earlier (Section 2.1). Recall that the proof 
of the Baum-Welch algorithm was based on maximization of 2N + 1 
expressions of the type maximized in Lemma 2, eq. (14). Since these 
expressions involve disjoint sets of variables chosen from A, B, 7, it 
suffices to consider any one of the maximizations. In fact, it suffices to 
show how Lemma 2 gets modified. Thus we wish now to maximize 


F(x) = ¥) edn x; (59) 
subject to the constraints 
» 4=1 (60a) 
and | 
xi Ze, t=1,---N. (60b) 


(From the following discussion it will be obvious that a trivial gener- 
alization allows € to depend on 1.) 

Now without the inequality constraints (60b), Lemma 2 showed that 
F(x) attains its unique global maximum when x; = c;/)ii ci. Suppose 
now that this global maximum occurs outside the region specified by 
the inequality constraints (60b). Specifically, let 

eye for i=l,--+,N-¢ (61a) 


Cj. 
=) 





J 
<e for 1=N-(?+1,---,N. (61b) 


From the concavity of F(x) it follows that the maximum, subject to 
the inequality constraints, must occur somewhere on the boundary 
specified by the violated constraints (61b). Now it is easily shown that 
if x; for some i > N — ?# is replaced by e, then the global maximum over 
the rest of the variables occurs at values lower than those given above. 
From this we conclude that we must set 


X,=e for 1>N-? (62) 
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and maximize 
: N-¢ 
F(x) = ¥} edn x; (63) 
i=1 
subject to the constraint YN x; = 1 — ¢. But this, analogously to 
Lemma 2, occurs when 


R=el-@_ye- isN-2 (64) 
XS 
J=1 


If these new values of x; satisfy the constraints, we are done. If one or 
more become lower than e, they too must be set equal to e, and 7 
augmented appropriately. 

Thus the modified Baum-Welch algorithm is as follows. Suppose we 
wish to constrain bj}, =¢«forlsj=Nand1sk<M. We first evaluate 
B using the reestimation formulas. Assume that some set of the 
parameters in the jth row of B violates the constraint so that bjz, < € 
for 1<i<?@. Then set 6;,, =«€ for 1 <i< ¢@ and readjust the remaining 
parameters according to (64) so that 

bjr 
N-¢ 


Y by 
t=1 





bin = (1 — &) VRE {ki|1 Sis 74. (65) 


After performing the operation of (65) for each row of B, the resulting 
B is the optimal update with respect to the desired constraints. The 
method can be extended to include the state transition matrix if so 
desired. There is no advantage to treating 7 in the same manner since, 
for any single observation sequence, 7 will always be a unit vector with 
exactly one nonzero component. In any case, (65) may be applied at 
each iteration of the reestimation formulas, or once as a post-process- 
ing stage after the Baum-Welch algorithm has converged. 


3.3 Combining models 


The final implementational issue that we shall consider in this 
section is that of combining models for improved stability. There are 
several circumstances under which it may be desirable to combine 
several models into one. In spectral estimation, for example, to com- 
pute a long-term average spectrum of a stationary signal, it may be 
convenient to average a number of spectra computed over shorter 
intervals. It seems quite natural to apply similar block-processing 
techniques to the Markov modeling problem if the source is assumed 
to be ergodic. We may, for example, divide a long sequence of obser- 
vations into contiguous subsequences, estimate model parameters for 
each subsequence, and combine the results. 
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Whether or not the source is ergodic, we may still attempt to 
increase the robustness of our model by averaging the parameter 
estimates derived from multiple initial values and/or independent 
observation sequences. 

In any case, the difficulty that will be encountered is that even if 
there are finitely many isolated local maxima of P, they are only 
unique to within a renaming of the states. For two different observation 
sequences, g; and gq; may be topologically equivalent, but i ¥ 7. We 
might try to avoid this problem by using the final parameter values for 
one observation sequence as the initial values for the next in hopes 
that this will restrict the search to a neighborhood of a single local 
maximum. This method, unfortunately, is not reliable. A better ap- 
proach is that of finding a renaming of the states that minimizes, in 
some sense, the difference between two models. 

Suppose b; and b;, 1 < j <= N are, respectively, the rows of two 
estimates of B. Let p(j) be a permutation of the state index, 7, and let 
d(-,-) be some distance metric. Then we seek the permutation, p, of 
(qi, Q2, --* , Qn) such that 


N 
D=¥ dbj, bu] (66) 
es 


is minimized. The naive solution is to try all possible N! permutations 
and select the best one in the sense of (66). However, for N > 10 the 
computation becomes intractable. The problem can be brought within 
reach, however, by transforming it into a minimum-weight bipartite- 
graph-matching problem on 2N vertices. In the literature on combi- 
natorial optimization (see, e.g., Ref. 28), several algorithms are avail- 
able for accomplishing such a match in a number of operations that 
grow as N®. In Appendix B, we describe one such algorithm based on 
an outline provided to us by R. E. Tarjan. 


IV. LEFT-TO-RIGHT HIDDEN MARKOV MODELS 


For the purposes of isolated word recognition, it is useful to consider 
a special class of absorbing Markov chains that leads to what we call 
left-to-right models. These models have the following properties: 
(t) The first observation is produced while the Markov chain is in 
a distinguished state called the starting state, designated qi. 
(it) The last observation is generated while the Markov chain is in 
a distinguished state called the final or absorbing state, designated qu. 
(tit) Once the Markov chain leaves a state, that state cannot be 
revisited at a later time. 
The simplest form of a left-to-right model is shown in Fig. 2, from 
which the origin of the term left-to-right becomes clear. 
In this section we shall consider two problems associated with these 
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414 492 433 44q = 1 


—_ _ 


bik bok 63k bak 


Fig. 2—The simplest form of left-to-right model. 


special hidden Markov models. Note that a single, long-observation 
sequence is useless for training such models, because once the state qn 
is reached, the rest of the sequence provides no further information 
about earlier states. The appropriate training data for such a model 
are a set of observation sequences obtained by several starts in state 
qi. In the case of isolated word recognition, for instance, several 
independent utterances of the same word provide such a set. We wish, 
therefore to modify the training algorithm to handle such training 
data. We also wish to compute the probability that a single given 
observation sequence, O,, O2 --- , Or, was produced by the model, 
with the assumption that O; was produced in state g: and Or in state 
qn. The three conditions mentioned above can be satisfied as follows: 

Condition (i) will be satisfied if we set 7 = (1, 0, --- , 0) and do not 
reestimate it. Condition (ii) can be imposed by setting 

Br(j) = f aes 


0 otherwise. OF 


Condition (iii) can be guaranteed in the Baum-Welch algorithm by 
initially setting a; = 0 for 7 < i (and in fact for any other combination 
of indices that specify transitions to be disallowed). It is clear from 
(34) that any parameter once set to zero will remain zero. For the 
gradient methods the appropriate a;;’s are just set to zero and only the 
remaining parameters are adjusted. 

The modification of the training procedure is as follows: Let us 
denote by O = [O”, O®, ..- O%] the set of observation sequences, 
where O® = O{OY ... OF? is the kth sequence. We treat the 
observation sequences as independent of each other and then we 
adjust the parameters of the model M to maximize 


K 
P = [| Prob(O“|M). 


k=1 


K 
= TT] Pr. (68) 
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Since the Baum-Welch algorithm computes the frequency of occur- 
rence of various events, all we need to do is to compute these frequen- 
cies of occurrence in each sequence separately and add them together. 
Thus the new reestimation formulas may be written as 
K T,-1 
oy at *(i) aid; (O%?,) Bia) 
~- = k=1 ¢t=1 (69) 


aij K T,-1 
yy abi) Bic) 
k=1 t=1 
and 
K 


XY y  ae(é) BF (i) 


=~ k=1,30¢(k) =u; 


Oy =e (70) 
» x at (i) Bi (i) 
As noted above, 7 is not reestimated. 

Scaling these computations requires some care since the scale factors 
for each individual set of forward and backward probabilities will be 
different. One way of circumventing the problem is to remove the scale 
factors from each summand before adding. We can accomplish this by 
returning the 1/P factor [which appears in (8) and (9) and was 
cancelled to obtain (10)] to the reestimation formula. Using the rees- 
timation formula for the transition probabilities as an example, (69) 
becomes 

K T;-1 


1 
> — Ly at (i) aizb; (O%) Bia(J) 


k=1 Py t=1 


ay = (71) 


T,-1 


Des nee Qe t(t) Be (i) 


If the right-hand side of (71) is evaluated using the scaled values of the 
forward and backward probabilities, then each term in the inner 
summation will be scaled by C?D#.1, which will then be cancelled by 
the same factor which multiplies P,. Thus, using the scaled values in 
computing (69) results in an unscaled a,. The procedure is easily 
extended to computation of the symbol probabilities. Also note that 
for the purposes of classification only one subsequence is to be consid- 
ered so that either (55) or (56) may be used unaltered to compute P. 

To apply Lagrangian techniques to left-to-right models we note that 
upon taking logarithms of (68) we have 


K 
log P= ¥&: log Px. (72) 
k=1 


The derivatives needed to maximize log P in (72) can be obtained by 
evaluating expressions for the derivatives of each individual subse- 
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quence and summing. For example, for a;; we have [cf. (57) and (58) ] 


F K F) T,-1 
— (log P) = ¥ — (log P) = CH ¥Y ab(i)d(O) Bin). (73) 
0ai; k= OQij 


As in all previous cases an analogous formula may be derived for the 
other parameters. 

In practice, A and B for left-to-right models are especially sparse. 
Some of the zero values are so by design but others are dependent on 
O. Parameters of this type will be found one at a time by standard line 
search strategies. We have found that the convergence of the Lagran- 
gian techniques can be substantially accelerated by taking large enough 
steps so that several positivity constraints become binding. The cor- 
responding variables are then clamped and (65) is applied before 
beginning the next iteration. 


V. NUMERICAL EXAMPLES 


In this section, we give some instructive examples of the behavior of 
several of the algorithms discussed above. The algorithms were all 
coded in FORTRAN 77 on a Data General MV-8000, which uses a 32- 
bit floating point word. The data used in the tests came from either a 
Monte Carlo simulation of a hidden Markov chain or from a portion 
of a newspaper text that was edited to include only the 26 characters 
of the English alphabet and a special character denoting an interword 
space. The simulations have the valuable property that the model is 
known a priori, so that simple models may be used for checking 
program correctness while the more complicated ones can elicit some 
subtle and important numerical and methodological characteristics of 
the algorithms. 

In our experiments we used the following procedure to generate 
observation sequences by means of a random number generator whose 
output is uniform on [0, 1] and specified values of 7, A, B, T, and V: 

(t) Partition the unit interval proportionally to the components of 
a. Generate a random number and select a start state, gi, according to 
the subinterval in which the number falls. Set ¢ = 1. 

(iz) Partition the unit interval proportionally to the components of 
the ith row of B. Generate a random number and select a symbol, vz, 
according to the subinterval in which the number falls. Set O; = vp. 

(tit) Partition the unit interval proportionally to the components of 
the ith row of A. Generate a random number and select the next state, 
q;, according to the subinterval in which the number falls. 

(tv) Increment ¢. If ¢ = T set g; = g; and repeat (iz) through (zu); 
otherwise stop. 

Using this observation generator, several two- and three-state Mar- 
kov models were simulated. These simulations were used to verify that 
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the parameter estimation algorithms were working correctly and to 
study the effects of the scaling interval on the accuracy of the algo- 
rithms. In this study we found that all scaling intervals that were 
sufficiently short to prevent underflow yielded numerically identical 
results. Thus one can, at one extreme, scale the forward and backward 
probabilities after each observation or, at the other, wait until a 
threshold signaling that underflow is imminent is exceeded and only 
then perform the scaling operation. 

We next proceeded to study a pair of four-state (N = 4), four-symbol 
(M = 4) models shown below and referred to as SRC44 and SRC45. 


SRC44: 


0 0 0.5 0.5 05 05 0O 0 
ies 0.5 O 0 0.5 Bis 0 05 05 0O 
~ 105 05 O 0 ~ 10 0 0.5 0.5 
0 05 05 0O 05 O 0 0.5 
am=[0.25 0.25 0.25 0.25] 
and 
SRC45: 
0 0 0.25 0.75 0.25 0.75 O 0 
vo 0.15 0O 0 0.85 B= 0 0.15 0.85 0 
0.2 0.8 0 0 0 0 0.1 0.9 
0 0.22 0.78 O 0.2 0 0 0.8 


a=[0.25 0.25 0.25 0.25]. 


The state transition diagrams for these models are shown in Fig. 3. 

Model SRC44 is a balanced model in the sense that all permissible 
transitions and symbols are equally likely, whereas model SRC465 is 
skewed in that it distinctly favors some transitions and symbols over 
others. 





Fig. 3—The four-state model used for testing. 
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For each of these sources we processed observation sequences rang- 
ing in length from 100 to 4000 with the Baum-Welch algorithm. Initial 
values for A, B, and 7 were chosen at random and the algorithm was 
terminated in one of two ways; either when the change in log P from 
one iteration to the next fell below an arbitrary threshold, or when the 
number of iterations exceeded a specified maximum value. The maxi- 
mum number of iterations was varied from 100 to 1000. For each 
estimate, B, of the source matrix B, a measure of estimation error 


2 1 N M ce 24 1/2 
|B - Bl = laa y > bs = ba (74) 
be J=1 k=1 


was computed, where p(/) is the state permutation that minimizes the 


-10.6 


100 ITERATIONS 





-49.9 
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400 ITERATIONS 


\|8-8 || IN DECIBELS 


“an 





100 4000 
NUMBER OF OBSERVATIONS 


_ MAXIMUM 


AVERAGE 
\\e&7 OF B 





100 1000 
NUMBER OF OBSERVATIONS 


Fig. 4—Estimation error as a function of number of observations for source SRC44 
for: (a) 100 iterations maximum, (b) 200 iterations, (c) 400 iterations, and (d) 10 random 
initial starts with 200 iterations maximum. 
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estimation error. The technique of minimum bipartite matching (see 
Appendix B) was used to determine the optimum state permutation. 

Plots of the quantity ||B — B|| (on a log scale) versus T, the number 
of observations, are given in Fig. 4 for source SRC44. Separate results 
are shown for 100 iterations maximum (part a) of the BW reestimation 
procedure, 200 iterations (part b), and 400 iterations (part c). Also 
shown in Fig. 4d are the results of using 10 random initial starts with 
200 iterations maximum. Shown in Fig. 4d are the maximum and 
minimum estimation errors for each T and the estimation error for the 
average of all 10 B matrices. (The reader should note that T' goes to 
4000 in parts a through c, but only to 1000 in part d of Fig. 4.) 

The curves given in Fig. 4 show several very interesting properties 
of the reestimation procedure. First we see that as the number of 
observations increases, a slow decrease in the average estimation error 
is obtained. However, we can see that statistical fluctuations (owing to 
different initial guesses for the model parameters) are often of larger 
magnitude than the slowly decreasing components of the curve. As the 
maximum number of iterations increases, the magnitude of the statis- 
tical fluctuations decreases, especially for larger values of T. 

The curves of Fig. 4d show that although there is a wide range in 
the value of the estimation error for multiple starting choices, aver- 
aging the B matrices (after appropriate state alignment) leads to 
estimation errors comparable with the best single estimates. 

Figure 5 shows a similar set of results for the Markov source SRC45. 
Figure 5a shows a curve of estimation error versus number of obser- 
vations for a maximum of 400 iterations, Fig. 5b shows the same curve 
with a maximum of 1000 iterations, Fig. 5c shows the curve when the 
initial estimates of both A and B are set to the source values exactly, 
and Fig. 5d shows maximum and minimum estimation errors for 10 
random starting points. 

Although the general trends of the data in Fig. 5 are similar to those 
of Fig. 4, there are several key differences. From Fig. 5b it can be seen 
that even for 1000 iterations, the variation in model estimates is 
enormous (36-dB variations). This result suggests that it is significantly 
more difficult to estimate parameters of a skewed Markov model than 
those of a fairly uniform model. The curves of Fig. 5c, in which the 
initial conditions were set to the source generator values, show that 
extremely good solutions could be obtained if the reestimation proce- 
dure could start in the neighborhood of the “exact” solution. Obviously 
this situation (i.e., starting near the correct parameter values) is not 
enforceable for real data. 

The curves of Fig. 5d, in which multiple estimates of the Markov 
model are averaged, show that averaging the individual parameter 
estimates does not lead to a low error estimate for SRC45. This is 
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-40.5 : 
100 500 


NUMBER OF OBSERVATIONS 
Fig. 5—Estimation error as a function of number of observations for source SRC45 
for: (a) 400 iterations maximum, (b) 1000 iterations maximum, (c) initial estimates of A 


and B set to source values, and (d) maximum and minimum estimation errors for 10 
random starting points. 


undoubtedly because of the parameter estimates with high errors that 
occur and which have an undue influence on the average. 


5.1 Left-to-right Markov source estimation 


The second series of experiments dealt with the left-to-right Markov 
models, as would be appropriate for our intended application to iso- 
lated word recognition. Figure 6 shows four such models. For each of 
these models (denoted as SRC195, SRC295, SRC395, and SRC495 in 
Fig. 6), the specifications were: 


N=5,M=9,7 = (1,0, 0,0, 0} 
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SRC195 
0.8 0.9 0.8 0.9 1.0 





Fig. 6—Left-to-right models used for testing for: (a) SRC195, (b) SRC295, (c) SRC395, 
and (d) SRC495. 


and 
0.7 O03 40 0 0 O 0 0 0 
0 0 08 02 0 O 0 0 0 
B=10 0 0 0 1 0 0 0 0 
0 0 0 0 0 02 08 O 0 
0 0 0 0 0 O 0 0.3 0.7 


The state transition probabilities were those shown in Fig. 6. The 
SRC195 model is a left-to-right model. The SRC295 model allows a 
transition between states 1 and 3 and states 3 and 5, as well as 
transitions between sequentially numbered states. Both the SRC395 
and SRC495 models include states whose self-transition probabilities 
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(a;;) are very small. We will see below that the average occupancy of 
such states is only 1 to 2 observations. For these non-ergodic models 
the concept of occupancy of the transient (i.e., non-absorbing) states 
is Important. 

If we denote the probability of a transition from a state to itself as 
p, then the probability of a transition out of that state at time T + 1 
(assuming the state was entered at time ¢ = 0) is 


Prob(g; att=0,1,2,---,7 and gj#qiatT+1)=p'(1—-p) 


for models of the form of SRC195. Hence the average occupancy of a 
state is given by 


d= ¥ (¢+ Dp -p) 


l-p (75) 


For p = 0.9 we get d = 10, for p = 0.8 we get d = 5, for p = 0.5 we get 
d = 2, and for p = 0.1 we get d = 1.1. Standard formulas are available 
for computing average state occupancies for arbitrary transition ma- 
trices. We will not consider them here; however, it is clear that states 
2 and 4 in models SRC395 and SRC495 are of low occupancy. 

To test the reestimation procedure on the Markov sources of Fig. 6, 
a set of K sequences were generated for each model, where K was the 
set (10, 25, 50, 100). Each sequence was generated using the Markov 
sequence generator described earlier, modified slightly to ensure that 
each sequence terminated in state 5 (the final state) and stayed there 
for 5 observations. The sequences were, however, of variable duration, 
depending on the exact sequence of state transitions that occurred. 

The results showed that for sequences generated from model 
SRC195, the correct model parameters (to within small estimation 
errors) were obtained for all values of K from 10 to 100. For sequences 
generated from model SRC295 only the 10 observation training se- 
quence yielded grossly incorrect model parameter estimates. All other 
sequences (K = 25, 50, 100) yielded the correct parameter values. 

For both source models SRC395 and SRC495, however, no com- 
pletely correct parameter estimates were obtained. In particular, the 
states whose expected occupancy was small (i.e., states 2 and 4 in both 
models) were merged with either the preceding or the following state 
(or both), while other states whose expected occupancy was larger 
were often split into 2 states, each with the same set of output symbols. 
These experiments indicate that it is difficult to reliably estimate 
parameters of a state, in a left-to-right model, whose average occu- 
pancy is very much smaller than that of the states to which it is 
connected. 
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5.2 Tests on a non-Markov source 


In the experiments described above, the observations were in fact 
generated by a known probabilistic function of a Markov chain. A 
more difficult test of these techniques was performed in which the 
observations were 5000 characters from a newspaper article edited to 
contain only the letters of the alphabet and spaces. We used the 
observation sequence to train a four-state, 27-symbol model. Of course, 
the “true” underlying model is not known, as it was in the tests 
previously described. Nor is it likely that the four-state model is 
complex enough to model the richness of structure of written English. 
Even if it were, it is unlikely that 5000 characters is sufficient to 
capture the structure. Unfortunately, these are exactly the limitations 
with which the experimenter will be faced in trying to model “real” 
processes. Our hope was that the text analysis problem would reveal 
some of the ambiguities that will be encountered in making hidden 
Markov models of natural phenomena. 

The text was first analyzed using the Baum-Welch reestimation 
formulas with a randomly chosen starting point. The algorithm con- 
verged in 310 iterations with log P = —1.317 x 10*. For purposes of 
comparison, we analyzed the same data with a quasi-Newton optimi- 
zation routine, VEO1A, from the Harwell Library.”® It required 125 
iterations to obtain a maximum value of log P of —1.356 xX 10*. In this 
case some care was required with the parameter values. The finite 
precision arithmetic occasionally results in a parameter value of —10~’, 
which appears to satisfy the positivity constraints. Such a value is fatal 
to the computation of log P since it will result in an attempt to take a 
logarithm of a negative quantity. Fortunately, this condition is readily 
detected in the scaling routine and corrected by setting the offending 
parameter to zero. 

Finally, we applied the Lagrangian technique described earlier to 
the same observation sequence but with a different set of initial 
parameter values. After 136 iterations, a still different model with 
log P = —1.827 x 10‘ is obtained. These results illustrate some 
important features of hidden Markov modeling. The computational 
methods used to obtain the models are roughly equivalent. All of the 
resulting models capture some of the structure of the data being 
analyzed. There are many different possible models with very little 
evidence for selecting the “best” one. For even very simple models, 
the likelihood function is too complicated to attribute the selection of 
one model or another by one algorithm or another to its properties. 
Finally, we note that all of the algorithms tested make large improve- 
ments to P during the early iterations and only slight incremental 
improvements later. In fact, the last half of the iterations provides no 
significant change to the model. We have used a convergence criterion 
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of 10°’ on the relative increment between iterations. This may be 
relaxed substantially, resulting in many fewer iterations with no at- 
tendant degradation in the model. 


VI. CONCLUSION 


We have presented some of the salient theoretical and practical 
issues associated with modeling data by probabilistic functions of a 
Markov chain. In our presentation we have concentrated on three 
issues: alternatives to the Baum-Welch reestimation algorithm, critical 
facets of implementation, and behavior of Markov models on certain 
artificial but realistic signals. 

We have observed that, while most of the discussion of parameter 
estimation for Markov models in the open literature is devoted to the 
Baum-Welch algorithm, classical optimization techniques are not only 
a viable alternative but may even be preferable in some cases. In 
particular, classical techniques are virtually unrestricted by the forms 
of either the likelihood function or the constraints. The reestimation 
formulas may be growth transformations for a wider class of functions 
and constraints than has heretofore been proven; however, it is not 
likely that a universal reestimation formula exists. For applications to 
continuous density functions (c.f. Liporace*’), the classical techniques 
may have still other advantages. 

The open literature has provided only a perfunctory, if any, discus- 
sion of some crucial numerical and implementational problems asso- 
ciated with Markov modeling. We have given details of methods of 
dealing with floating point loss of significance, finite-training-set size, 
and model stability. Wherever possible we have made our techniques 
formal and algorithmic. 

Finally, we have given several examples of the behavior of Markov 
modeling techniques on some reasonably realistic data. The most 
important lesson that can be drawn from these experiments is that 
even under ideal conditions (i.e., when the data are associated with a 
known hidden Markov process) and all the more so under realistic 
conditions, the computed models may contain artifacts and may not 
faithfully represent the inherent structure of the data. Thus, great 
caution and empirical validation is required in using these techniques. 

Despite this caveat, hidden Markov models may be beneficial in 
studying many diverse problems. In our companion paper”! we recount 
a successful application of this body of theory to a problem in auto- 
matic speech recognition. 
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APPENDIX A 
Several of the formulas derived in the text are much more compact 
in matrix notation. Let ’ denote matrix transposition, as usual, and let 


the column vectors z and 1, and the matrices A, B;,t = 1, ---, Tbe 
defined as in the text. Also let a and £; be column vectors with 
components a:(i), i = 1, --- , Nand (i), i = 1, --- , N, respectively. 
Then the recursion for a; is 
At+1 = Bri A’ a, t= 1; eters | T = (76) 
The recursion for ; is 
B: = A Bes Br4i t=T- 1, Bee hs 1. (77) 
The starting values are 
Q = Bir 
Br=1. (78) 
The probability P is given by 
P= Bia for any ¢ in (1, T). (79) 
The special cases ¢ = 1 and t = T give 
P=7'Bipi (80) 
and 
P=Var 
= 1’BrA’Br_,--- A’Bim. (81) 


In each of these formulas P can be regarded as the trace of al X 1 
matrix, which [as expanded in (81)], is a product of several matrices. 
The fact that the trace of a product of matrices is invariant to a cyclic 
permutation of the matrices can be used to advantage in finding the 
gradient of P. Define V4P as the matrix whose ijth component is 
dP/da;. Similarly, define VeP and V,P. Then it is straightforward to 
show that 


VP > BiB, 
T-1 


VaP = > OP Bri 
t=1 


(VaP) = 2 (A ’ar-1);(B);. (82) 


DSO=k 
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In the last equation, if O, = v; then the corresponding term in the 
sum is just 7’. 


APPENDIX B 


As we mentioned in Section 3.3, it is frequently necessary to compare 
(or average) two different estimates B and B of the symbol matrix. 
The optimization procedures, in general, relabel the states; therefore 
the rows of B may, in general, be permuted relative to those of B. 
Before comparison, therefore, the optimum permutation must be 
found. This is defined as one that minimizes the distance D defined in 
eq. (66). The problem of finding this permutation can be converted to 
a network optimization problem called “bipartite weighted matching.” 
To this end define w, as the distance of the 7th row of B from the jth 
row of B. As we have done in Fig. 7, draw two sets P and Q of N 
vertices each. For 1 <1, 7 <= N, draw an edge from the 7th vertex in P 
to the jth vertex in Q, and label this edge with the weight w,. The 
resulting graph is a complete weighted bipartite graph, and the prob- 
lem is to find an N-match (i.e., one to one matching with N edges) 
such that it has minimum weight. 

Suppose that Z; is a k-match (k = N). With respect to this match 
make the following definitions: 


---FREE VERTEX 





Fig. 7—Complete bipartite weighted graph (N = 4) and a 2-match (matched edge 
-»). Path shown in heavy lines is an alternating path for the 2-match. It also happens to 
be an augmenting path. 
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(t) A matched edge is an edge in Zz. 

(it) A free vertex is one that is not on a matched edge. 

(viz) An alternating path is a path along edges that alternately 
belong to Z; and do not belong to Z;. [N.B. the number of matched 
edges, m, on an alternating path may have any value m < k. The 
degenerate case m = 0 is also valid.] 

(tv) An augmenting path is an alternating path between two free 
vertices. Again note that a single edge connecting two free vertices is 
a valid augmenting path. 

An augmenting path has the structure 


DMUqgMp2.U gq --- Uq;. (83) 


Here U represents an unmatched edge, M represents a matched edge, 
and p, and q; are the only free vertices on the path. (Here p; is the ith 
P-vertex along the path. The gq; are similarly defined.) Note that the 
number of U’s on an augmenting path is exactly one more than the 
number of M’s. Hence the total number of edges in an augmenting 
path is always odd. 

Suppose we are given a k-match Z, and an augmenting path ap of 
length 27 + 1. Then we can obtain a (k + 1)-match Zz+i1 by a 
complementary labeling of the edges of ap (i.e., by changing every U 
to M and vice versa). If w,, we, +++ , W241 are the weights along ap, 
then the weight Zz41 exceeds that of Z, by the amount w; — w2 + 
W3 — +09 b Woy41. 

This method of obtaining a (k + 1)-match from a k-match has the 
following key property (proof given below): Suppose M, is a minimum- 
weight k-match. Let apm be an augmenting path for M; with minimum 
incremental weight. Then the match Mz+: obtained from M;, and apm 
is a minimum weight (k + 1)-match. 

Assuming this property for the moment, a minimum-weight N- 
match can be determined by the following N-step algorithm: 

For k = 1, 2, --- , N generate M; by finding an optimum augmenting 
path for M;-. (Note that Mb is an empty set.) 

We now show that finding an optimum augmenting path is equiva- 
lent to a shortest path problem. With reference to Fig. 8 let us generate 
a directed graph by directing each edge in M, to the left and each 
unmatched edge to the right. Also let us multiply the weights of all 
matched edges by —1. Finally, add two vertices labeled s and t. 
Connect s to all free vertices in P by right-going edges of zero weight. 
Similarly, connect all free vertices of Q to t by right-going edges of 
zero weight. In this directed graph any path from s to t is an aug- 
menting path for the matching M;-1. Hence, if we interpret the weights 
as lengths, it is clear than an optimum augmenting path is a shortest 
path from s to t. 
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Fig. 8—Directed graph obtained from Fig. 7. Every path from s to t is an augmenting 
path of the 2-match (except for the dummy edges of 0 weight from s and to f). 
Interpreting weights as lengths, the length of a path from s to t is the incremental weight 
of the corresponding augmenting path. 


A shortest path problem can be solved in polynomial time. However, 
‘the problem can be solved much more easily and efficiently if the path 
lengths are all nonnegative. In that case the problem can be solved in 
N? time by Dijkstra’s method.”® 
It is possible to avoid negative distances in our problem by the 
method of assigning a “potential” f(v) to each vertex. Suppose that at 
step k the graph has nonnegative weights, and the edges corresponding 
to Mz-1 have zero weight. Then use Dijkstra’s method to find shortest 
paths to all vertices (including t) from s. Define f(v) for the vertex v 
as its shortest distance from s. Next, modify the weight w, of the edge 
from 2 to 7 to 


wi = wy + f(t) — f(y). (84) 


It is easily seen that this procedure leaves all weights nonnegative, 
does not alter shortest paths, and all shortest paths have weight 0. 
Reversing a shortest path from s to t gives us a matching M,, and the 
new graph has nonnegative weights and zero weights for the matched 
edges. Now Mb trivially has the postulated properties. Therefore, at 
every step the graph will have these properties. 

An important property of the above procedure is that if a vertex p 
(or g) is on some matched edge in Mg, then it will be on some matched 
edge in Mes: also. In writing the actual computer program, this 
property simplifies the bookkeeping considerably. 

Another important property concerns the vertex potentials. Suppose 
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(ii) N\\N\ OO "\/\/"\W 0 0 


V7) A\\ 00 V\\—0 


(iV) OAV /\/\ OO \ \ O00 


NWE Mz 
Fig. 9—Types of paths in the symmetric difference between Mz and Nz+1. E Neat 
+1 


~~ 


There must be exactly one more path of type (iv) than of type (iiz). 


F(v) is the sum of the potentials assigned to vertex v at each of the N 
steps. Suppose w,; and d;; are the original and final weights, respec- 
tively, of the edge connecting vertex 1 to vertex 7. Then 


diy = wi + F(t) — F(Z) 2 0, (85) 


and d;; = 0 for every edge in the final N-match. The numbers F(v) 
thus provide a simple proof that the final match is indeed an optimum 
match. 

We turn now to a proof of the key property mentioned above. Let 
M, be an optimum k-match and let Nz.+1 be any optimum k + 1-match. 
Then we will show that there is a k + 1-match obtained from an 
augmenting path of M; such that its weight is equal to that of Nasi. 
For this purpose define S as the set of edges in the symmetric difference 
of M; and Nz+:. (Recall that the symmetric difference of two sets, A 
and B, is the set of elements that belong either to A or to B but not to 
both.) 

From the geometry of a bipartite graph three properties of the set 
S are obvious: 

(t) The number of edges in S which belong to Nz+41 must exceed 
the number that belongs to M, by exactly 1. 

(ii) The edges on any path in S must alternate between M; and 
Nes. 


SPEECH RECOGNITION 1073 


(zit) A vertex on an edge in S cannot be shared with any edge in M, 
or N;+1 that does not belong to S. 

From the first two of these properties it follows that there can be four 
types of paths in S (see Fig. 9). 
(t) Circuits, each consisting of an even number of edges; 

(ii) Open paths, each with an even number of edges; 

(tii) Open paths, each with an odd number of edges beginning with 
an edge in M;; 

(tv) Open paths, each with an odd number of edges beginning with 

an edge in Nz4:. The number of such paths must be exactly one more 
than the number of paths of type (ii). 
It is easily seen that the incremental weight of every path of type (z) 
or (iz) must be exactly zero. (If it is negative then the weight of M; can 
be decreased by a complementary labeling of the path; if positive then 
the weight of Nz: can be decreased in the same way. But this 
contradicts the hypothesis that M, and Nz+1 are optimum matches.) 

In view of this and property (3), Nz+1 can be modified by replacing 
its edges on all paths of types (z) and (iz) by the corresponding edges 
of M;. The modified k + 1-match has exactly the same weight as that 
of Nz+1; however, the symmetric difference between M; and the mod- 
ified k + 1-match has no paths of type (Zz) or (ii). 

The same argument applies to pairs of paths, one path each of types 
(zzz) and (iv). Thus a final modified k + 1 match is obtained whose 
symmetric difference with Mz is exactly one path of type (zv). The 
weight of this final modified k + 1-match is exactly the same as that 
of the original Nz+1, and is obtained from an augmenting path of M,. 

We have written a subroutine that implements the above procedure. 
The timing, from a number of test runs, is approximately 0.063N? ms 
central processing unit (CPU) time on the MV-8000. Thus for N = 40 
about 4 seconds of CPU time is needed. 
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In this paper we present an approach to speaker-independent, 
isolated word recognition in which the well-known techniques of 
vector quantization and hidden Markov modeling are combined with 
a linear predictive coding analysis front end. This is done in the 
framework of a standard statistical pattern recognition model. Both 
the vector quantizer and the hidden Markov models need to be 
trained for the vocabulary being recognized. Such training results in 
a distinct hidden Markov model for each word of the vocabulary. 
Classification consists of computing the probability of generating the 
test word with each word model and choosing the word model that 
gives the highest probability. There are several factors, in both the 
vector quantizer and the hidden Markov modeling, that affect the 
performance of the overall word recognition system, including the 
size of the vector quantizer, the structure of the hidden Markov 
model, the ways of handling insufficient training data, etc. The 
effects, on recognition accuracy, of many of these factors are discussed 
in this paper. The entire recognizer (training and testing) has been 
evaluated on a 10-word digits vocabulary. For training, a set of 100 
talkers spoke each of the digits one time. For testing, an independent 
set of 100 tokens of each of the digits was obtained. The overall 
recognition accuracy was found to be 96.5 percent for the 100-talker 
test set. These results are comparable to those obtained in earlier 
work, using a dynamic time-warping recognition algorithm with 
multiple templates per digit. It is also shown that the computation 
and storage requirements of the new recognizer were an order of 
magnitude less than that required for a conventional pattern recog- 
nition system using linear prediction with dynamic time warping. 


1075 


|. INTRODUCTION 


There currently exist two standard approaches to isolated word 
recognition, namely, feature extraction methods and statistical pattern 
recognition models. A statistical pattern recognition approach has the 
property of being a nonparametric approach to recognition and there- 
fore is widely used in most commercial and industrial recognizers.’° 
The feature-based approach to recognition has been primarily used in 
the (computationally) less expensive systems, and as a basis for rec- 
ognition of continuous speech (in conjunction with segmentation and 
labeling algorithms).*® 

In the past few years a new approach to speech processing has been 
proposed, namely, using probabilistic functions of Markov models. 
This approach has been applied at the Institute for Defense Analyses 
for speaker recognition,’° and at Carnegie Mellon University and IBM 
to solve problems in continuous speech recognition’»” with good 
success. Based on its success in these related areas of speech process- 
ing, a question that arises naturally is how well these probabilistic 
models would work on problems in isolated word recognition. 

It is the prime purpose of this paper to provide an answer to the 
question posed above. Before discussing the approach we have taken 
to get at the answer, we must first describe the structure of a word 
recognition system based on (hidden) Markov models (HMM). As in 
most recognition systems we assume we have a labeled training set of 
data from which we build a series of Markov models, one for each 
vocabulary word. Then when we want to recognize an unknown token, 
we compute a probability score for each word HMM on that token, 
and choose as the recognized word the one corresponding to the model 
with the highest probability score (i.e., the most likely word HMM). 
Techniques for training and scoring such HMMs are discussed both 
here and in the companion paper.* 

In a conventional pattern recognition system the unknown test 
token is time-aligned in turn to each reference pattern via some form 
of time-warping procedure, typically, dynamic time warping (DTW). 
By contrast, no such direct alignment is performed in the HMM 
system; only an indirect time alignment is obtained based on the 
probabilistic scoring. Thus it is interesting to study the relationship 
between probabilistic scoring and DTW as applied to isolated word 
recognition. As we shall see, there is no simple relationship. We will 
point out several similarities and differences in the two approaches. 

The organization of this paper is as follows. In Section II we briefly 
review the conventional DTW word recognizer based on LPC model- 
ing, since this will be the focus of comparison throughout the paper. In 
Section III we review the basic ideas behind the use of HMMs for 
isolated word recognition. It is the purpose of this section to establish 
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notation and terminology that will define the basic parameters of 
interest in the HMM system. Section III shows that one inherent 
feature of the HMM recognizer (as we have implemented it) is that 
the models need a discrete, finite set of observations (input data) to 
obtain the best model parameters for each word in the vocabulary. A 
vector quantizer (VQ) was used to transform the continuous set of 
linear predictive coefficient (LPC) vectors into a finite observation set. 
Therefore, in Section IV we describe the key ideas behind vector 
quantization of LPC sets, and discuss the particular implementation 
that we used. In Section V we describe the overall structure of the 
HMM isolated word recognizer. In Section VI we describe a series of 
experiments used to evaluate the performance of the HMM word 
recognizer and compare it to the performance of a DTW recognizer on 
the same vocabulary. The effects, on performance, of several parameter 
variations in the HMM and VQ are also described in this section. In 
Section VII we discuss the results of the performance evaluation and 
comparison experiments. The strengths and weaknesses of the HMM 
word recognizer are discussed, along with computational and storage 
comparisons of HMM and DTW word recognizers. We attempt, in this 
section, to determine the fundamental relationships between the HMM 
and DTW systems. 


ll. REVIEW OF CONVENTIONAL DTW WORD RECOGNIZER BASED ON 
LPC MODELING 


Figure 1 shows a block diagram of the LPC-based isolated word 
recognizer.”* The input speech signal, s(n), recorded over a standard 
dialed-up telephone line, is bandpass-filtered between 100 and 3200 
Hz, and digitized at a 6.67-kHz rate. The first step in the processing is 
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Fig. 1—Block diagram of conventional LPC-based word recognizer using a standard 
dynamic time-warping algorithm for registering test and reference patterns. 
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the preprocessing block, a first-order digital network, which provides 
a high-frequency pre-emphasis to the speech. The pre-emphasized 
signal is blocked into frames of 45 ms (300 samples) with each consec- 
utive frame spaced 15 ms (100 samples) apart. An 8-pole LPC analysis 
(autocorrelation method) is performed on each frame of the word 
(after isolating it with an endpoint detector”), thus creating the test 
pattern. This test pattern is compared with each reference pattern 
using a DTW alignment algorithm that simultaneously provides a 
distance score associated with the alignment. The distance scores for 
all the reference patterns are sent to a decision rule, which provides a 
classification of the spoken word, and possibly an ordered (by distance) 
set of the best n candidates. 

The word reference patterns for the recognizer of Fig. 1 are created 
by a training algorithm. For speaker-trained applications, typically a 
single reference pattern is created for each word in the vocabulary 
using a robust training algorithm.” For speaker-independent applica- 
tions, a set of Q reference patterns is created for each vocabulary word 
using a clustering procedure.’®'’ Typically, about 12 templates per 
word are sufficient for recognizing words from a fairly homogeneous 
adult population of native American talkers. 


ill. BASICS OF HMM FOR WORD RECOGNITION 
We assume we have a finite sequence, O, of observations, 


O = 0,02 --+ O7, (1) 


where each observation is a discrete symbol drawn from a finite 
alphabet of symbols. (For the system we will be describing, the 
observations are the indices of the LPC vectors obtained from an LPC 
vector quantizer.) We further assume that the sequence of observations 
may be modeled as a probabilistic function of an underlying Markov 
chain whose state transitions are not directly observable; hence the 
name “Hidden Markov Model.” Figure 2 shows such a model, M, 
which is characterized by the following: 

(tz) N = the number of states in the model. For the model of Fig. 
2,N=5. 





Fig. 2—A typical state diagram for a 5-state Markov model. 
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(ti) M = the number of output symbols in the discrete alphabet of 
the model. For the present example, M = 5. 

(iii) A = {a;}, the transition matrix of the underlying Markov 
chain. Here, a; is the probability of making a transition to state J, 
given that the model is in state 7. For the model of Fig. 2 we have 


08 01 01 0 O 

0 08 02 0 O 
A=|;0 0 08 O11 0.1 

0 0 O 08 0.2 

0 0 0 O 10 


Note that only 11 of the 25 aj;’s are nonzero. 

(iv) B = {By} = {0;(k)}, the model output symbol probability 
matrix, where 0;(k) is the probability of outputting symbol k, given 
that the model is in state 7. For the example chosen, 


05 05 0 0 O 
0 05 05 0 O 
0 O 05 0 05 
05 0 0 05 0 
0 0 O 06 0. 


(v) «= {a}, t = 1, 2, ---, N, the initial state probability vector. 
For the left-to-right models of the type shown in Fig. 2, we assume the 
system always begins in state 1, i.e., 7, = 1, 7, =0,14 1. 

Isolated word recognition using HMM consists of two phases, train- 
ing and recognition (or classification). In the training phase, the 
training set of observations is used to derive a set of reference models 
of the above type, one for each word in the vocabulary. In the 
classification phase, the probability of generating the test observation 
is computed for each reference model. The test is classified as the word 
whose model gives the highest probability. The computations in each 
of these phases are fairly straightforward. 

Let us begin with the classification phase. Given the observation 
sequence, O, and a model, M (i.e., N, M, A, B, and z), the probability 
of O having been generated by model M is 

P(O|M)= Ym, bi,(Or)@i,i, +++ Gip_yipdiz(Or). (2) 


11,09 cee LT 


B= 


The summation in eq. (2) is more readily computed by defining a 
forward partial probability, a;(i), as 
a(t) = P(O,O2--+ O; and state i at time ¢|M). (3) 


This leads to the recursion 
N 


ai(J) = b c(i bj(Or+1), #=1,2,---,T-1 (4) 


i= 
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by which eq. (2) can be expressed as 
N 
P(O|M) = P= )! ar(J). (5) 
j=l 


In the training phase an initial estimate of the model is made, and 
P is computed for the training observation sequence according to eq. 
(5). Next the model is iteratively adjusted to increase P. The iterations 
are stopped when P stops increasing significantly, or when some other 
stopping criterion is met (e.g., when the number of iterations exceeds 
a limit). 

One remarkable algorithm for improving a trial model is the Baum- 
Welch reestimation algorithm.’® However, maximizing P can also be 
looked upon as a constrained optimization problem, for which many 
algorithms have been proposed. In the companion paper in this issue 
of the Journal,’ we discuss the relative merits of these procedures. 

We now discuss a number of factors that influence the performance 
of HMM recognizers. 


3.1 Initial estimates of A and B 


One factor of interest for the HMM recognizer is the choice of initial 
estimates for the elements of the matrices A and B. The problem here 
is that although the training procedure is guaranteed to reach a critical 
point of P, the value of P obtained is typically a local maximum. 
Hence, alternative starting values of A and B could yield models with 
higher (or lower) values of P. For our simulations we have chosen to 
start the training models with essentially random choices for the 
nonzero elements of both A and B, normalized to satisfy the constraints 

N 


> aj;=1 i=1,2,---,N (6a) 
Pp 6(k)=1 f=l,2,---,N. (6b) 
An alternative starting condition could be 
aj=1/Nt+e (7a) 
b(k) =1/M +, (7b) 


where € is a uniformly distributed random variable whose peak is much 
smaller than either 1/N or 1/M. [Again the aj;’s and 0;(k)’s of eq. (7) 
must be normalized using eq. (6) prior to running the optimization. ] 
3.2 HMM structures and the number of states 


A second factor affecting the determination of optimum HMMs for 
each vocabulary word is the model structure and the number of states. 
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(b) 





Fig. 3—State diagrams for: (a) unconstrained Markov model with four states, 
(b) constrained serial Markov model with four states, and (c) constrained parallel 
Markov model with six states. 


We have considered three types of model structures, namely uncon- 
strained, constrained serial, and constrained parallel. Typical examples 
of each of these models are shown in Fig. 3. In the unconstrained 
models (shown in Fig. 3a) a transition from any state to any other 
state can be made—i.e., all aj;’s are allowed to be nonzero. Both the 
constrained serial models (shown in Fig. 3b) and the constrained 
parallel models (shown in Fig. 3c) are left-to-right models, i.e., the 
state transition matrix A is upper triangular. The serial models gen- 
erally proceed sequentially through the states (although individual 
states can be skipped over), whereas the parallel models allow multiple 
paths through the model, with each path skipping one or more model 
states. For example, there are four distinct paths through the model of 
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Fig. 3c, 1-2-4-6, 1-2-5-6, 1-3-4-6 and 1-3-5-6, each of which traverses 
four of the six model states. 

Each of the model structures of Fig. 3 can be generalized to include 
an arbitrary number of states. Recall, however, that the number of 
free parameters of the Markov model is on the order of N? (for the A 
matrix) plus NM (for the B matrix). Hence, if N gets too large, accurate 
and reliable determination of the optimum A’s and B’s may become 
difficult for a fixed-size training set. However, within these constraints 
we have investigated models with as few as two states, and as many as 
20 states. There appears to be no good theoretical way to choose the 
number of states needed for a word model, since the states need not be 
physically related to any single observable phenomenon. 


3.3 Multiple observation sequences 


A third factor affecting the determination of the optimum HMM for 
each vocabulary word is the observation sequence used for training. 
Since we are interested in obtaining speaker-independent models, the 
observation sequence, O, actually consists of several independent 
sequences O), k = 1, 2, --- , K, where O™ is the training sequence for 
talker k, and K is the total number of talkers used for training. 
Typically, a value of K = 100 has been used in our clustering work for 
speaker-independent training. The way in which we handle multiple 
sequences is to calculate P(O™|M), using eq. (5), for each sequence, 
and maximize the product of the probabilities, i.e., 


K 
P= |] P(O™|M). (8) 
k=1 


The implementation of the computation of eq. (8) is straightforward 
for the Baum-Welch reestimation procedure, as well as for the gradient 

methods.”* Thus the fact that the training data consist of multiple 
sequences causes no problem in estimating the optimum HMM param- 
eters. 


3.4 Constraints on A, B matrices during training 


As we show in Fig. 3 we have considered three general HMM 
structures. For the unconstrained structure the A and B matrix ele- 
ments are allowed to assume any value consistent with the stochastic- 
ity constraints. For the constrained serial models we have considered 
two general constraints, namely: 


SCl:a;=0 for j<i and j=i+3 (double skip allowed) (9a) 
SC2:a;=0 for j<i and j=i+ 2 (single skip allowed). (9b) 


These two cases are illustrated in Fig. 4 for a 5-state model. Constraints 
SC1 allow single- or double-state jumps when exiting a given state, 
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SC1 MODEL 





$C2 MODEL 


1 2 3 4 5 
(b) 


Fig. 4—Markov models for two types of serial constraints: (a) single and double 
transitions allowed, and (b) only single transitions allowed. 


whereas constraints SC2 only allow single jumps when exiting a state. 
Hence, models of the type shown in Fig. 4a have somewhat more 
flexibility than those of the type shown in Fig. 4b. 

For the constrained parallel models we have only considered tran- 
sition matrices of the type illustrated in Fig. 3c, i.e., a given state can 
only exit to either of a pair of states in the next column of the grid. 

For the most part we have not constrained the B matrix. However, 
one problem arises if the B matrix is left completely unconstrained. 
The problem is that a finite training sequence of length T may result 
in b;(k) = 0. In classification it can then be the case that a;-1 (2) aj is 
nonzero for only one value of 7, and O; = k, then the probability of that 
sequence arising from the model with 0;(k) = 0 is P= 0; hence a 
recognition error must occur. This is the so-called missing or inade- 
quate training data problem. We handle this problem (see Ref. 13 for 
a justification) by using post-estimation constraints on the 6;(k)’s of 
the form 


bj(k) = «, (10) 


where ¢€ is a suitably chosen threshold value. All };(k)’s are compared 
to the e threshold and those that are below «€ are replaced by ¢; for 
each j. After this replacement, each b;(k) that was not changed to the 
e value is rescaled by the quantity 1 — R;e [where R; is the number of 
b;(k)’s changed for a given 7] to properly normalize the 6;(R)’s. 

It should be clear that all the constrained HMMs are left-right 
models in that the observations must begin in state 1, must proceed 
from state to state in a monotonically increasing manner, and must 
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end in state N. Thus, temporal information in the observation sequence 
is coded directly into the left-to-right HMM. 


3.5 Multiple estimates of A, B, and averaging 


As we mentioned earlier, the reestimation and gradient optimization 
procedures are guaranteed to find a critical point of P for each HMM. 
However, in practice, a large number of such points exist in the 
parameter space. Thus, different initial conditions on A and B may 
lead to different solutions. To understand the variability in model 
parameters as well as its effects on overall recognition performance, a 
series of HMMs were obtained for each word by selecting R random 
starting sets for A and B, and solving for the optimum A and B in each 
case. By scoring each of the R models individually, we can obtain an 
indication of the statistical variability in performance score owing to 
uncertainty in A and B. 

An alternative procedure to using multiple HMMs for each word, 
obtained from different random starting values for A and B, is to 
average the R sets of A and B to give an averaged model for each 
word. The effects of such averaging on word recognition accuracy will 
be discussed in Section 6.3. 


3.6 Scoring of observation sequences 


One way to score a given observation sequence, O, is to use the 
iterative calculation of eqs. (4) and (5). We call this the Baum-Welch 
score, Pgw. For left-to-right models, eq. (5) is modified as 


Paw = ar(N) (11) 


because the sequence is constrained to end in state N. 

An alternative scoring procedure for the observation sequence, O, 
given the model, M, is the Viterbi algorithm,” which may be compactly 
stated as: 

(t) Initialization—6,(z) = log[7:b;(O1)], 1 = 1, 2, ---, N 
(ii) Recursion—for2<t=<=T,1sjsN 


6:(7) = max {5:-a(z) + log[aijb;(Oz) J} 


(viz) Termination—Py; = 67(N) for left-to-right models 
N 
= ) 5r(/) for unconstrained models. 
j=l 
The above algorithm is a form of the well-known dynamic program- 
ming method and can be shown to have the property of determining 
the state sequence i = i112 --+ ir, which maximizes 


P(i|O, M). 
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It is easily shown that both the Baum-Welch and Viterbi scoring 
procedures require roughly the same amount of computation. The 
major differences are in the interpretation of the resulting solutions. 


IV. VECTOR QUANTIZATION OF LPC COEFFICIENTS 


In Section III we noted that in our implementation of HMMs for 
isolated word recognition the inputs to the model are assumed to be 
sequences of discrete symbols chosen from a finite alphabet. We 
obtained these discrete symbols by using the method of vector quan- 
tization’! of the LPC vectors measured as described in Section II. In 
this section we review the theory of vector quantization and discuss its 
implementation for isolated word recognition. 


4.1 Theory of vector quantization 


Assume we have a training set of LPC vectors, a;, i = 1, 2, ---, J, 
which are a good representation of the types of LPC vectors that occur 
when the words in the vocabulary are pronounced by a wide range of 
talkers. The main idea behind vector quantization is to determine the 
optimum set of codebook LPC vectors, Am, m = 1, 2, --- , M, such that 
for a given M, the average distortion in replacing each of the training 
set vectors, a;, by the closest codebook entry, 4m, is minimum. 

More formally stated, if we define d(ar, ar) as the distance between 
two LPC vectors, az and a7, then the goal of vector quantization is to 
find the set, 4,,, such that 


1 I 
|| Dac|| = min {7 y% min [d(dm, ai} (12) 
am I i=] lsmsM 


is satisfied. The quantity ||D,|| is the average distortion (distance) of 
the vector quantizer. 

The way in which eq. (12) is solved, for a given value of M, is due to 
Juang et al.”’ The algorithm first finds the optimum solution for M = 
2 (two codebook entries), then splits each optimum LPC vector into 
two components, and finds the optimum solution for M = 2-M. This 
procedure iterates until M is as large as desired. A flow diagram of the 
details of the codebook generation procedure is given in Ref. 21. The 
local distance used in our system was the likelihood distance,” 


aR Vrak 


——- l 1 
ar Vrar : ( 2) 


d(ar, ar) = 


where V7 is the autocorrelation matrix of the sequence that gave rise 
to LPC vector ar. 
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4.2 Implementation of the vector quantizer 


To train the vector quantizer we used a set of 39708 LPC vectors, 
obtained by using all the vectors in one complete set of 10 isolated 
digits uttered by each of 100 talkers (50 male, 50 female). Applying the 
algorithm of Juang et al.,”" we generated vector quantizers of size 
M = 2, 4, 8, 16, 32, 64, and 128. During the course of running the 
algorithm, several performance criteria were monitored, including: 

(i) Average distortion, || Dy||, of eq. (12) 

(ti) Sigma ratio (cluster separation) of the resulting codebook en- 

tries (clusters), defined as 


iM(_1 \S aaa 
15 (waa) Raman 
: [Dull : i 
where the numerator is the average intercluster distance, and || Dy,|| is 
the average intracluster distance. 
(tit) Cluster cardinality, N;, defined as the number of tokens in the 
ith cluster (i.e., the cluster represented by the ith codebook entry). 
(iv) Cluster distortion, d;, defined as the average distortion (dis- 
tance) for the zth cluster. 
It should be clear that the average distortion, ||Dyz||, satisfies the 
relation 





Tee 
[Dal] => 2 ae Ni (15) 
i=l 
and that the cluster occupancy satisfies the relation 
M 
YN =. (16) 
i=l 


Results of running the VQ algorithm on the training set of 39708 
vectors are given in Figs. 5 through 8. Figure 5 shows plots of || Dy 
versus M (on a log scale) (part a), and the o-ratio versus M (part b) for 
values of M from 2 to 128. We can see that for values of M => 32 the 
average distortion falls below 0.3, and that for M = 64 the value of 
|| Dar|| = 0.2. If we use the conventional recognizer of Fig. 1, the average 
distance between repetitions of a word (after DTW alignment) has 
been found to be on the order of 0.3 to 0.4;’® hence, values of || Dir|| < 
0.3 imply smaller error for the VQ than for interreplication variations 
of words. The o-ratio plot shows ratios greater than 10 for M = 32; 
hence, extremely good cluster separation is achieved in the vector 
quantizer for these values of M. 

Figures 6 through 8 show a detailed analysis of the statistics of the 
vector quantizer output for M = 128. Figure 6a shows the cluster 
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2 4 8 16 32 64 128 
M ON LOG SCALE 


Fig. 5—Plots of vector quantizer performance versus size of codebook for: (a) average 
distortion, and (b) sigma ratio. 


cardinality as a function of the VQ index, and Fig. 6b a histogram of 
cluster cardinality. The largest cluster has 857 tokens, whereas the 
smallest cluster has 119 tokens; hence, a spread of over 7 to 1 in cluster 
occupancy is obtained. The average cluster cardinality, for this case, 
is 310 tokens, as denoted by the dashed line in Fig. 6a. The histogram 
of cluster cardinality indicates that the vast majority of clusters have 
fewer than the average number of tokens. 

Figure 7a shows the cluster distortion as a function of the VQ index, 
and Fig. 7b shows a histogram of cluster distortions. The largest 
distortion for any cluster is 0.303, whereas the smallest distortion is 
0.047; hence, a spread of more than 6 to 1 is observed in cluster 
distortions. The dashed lines in Fig. 7 denote the average cluster 
distortion, which in this case is 0.165. 

Finally, Fig. 8 shows a plot of the total cluster distortion, defined as 
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Fig. 6—Plots of (a) cell occupancy versus codebook index, and (b) histogram of cell 
occupancy for 128-codeword vector quantizer. 


Nid;, versus VQ index. The range of total cluster distortion is from 25 
to 72; hence, a spread of less than 3 to 1 is obtained. It is conjectured 
that the “ideal” vector quantizer seeks to determine the set of 
“optimum” codebook vectors such that the total cluster distortion is 
as close to uniform as possible. Hence, clusters with large average 
distortions should have low cardinality, whereas clusters with small 
average distortion should have high cardinality. It can be seen from 
Figs. 6 through 8 that the total cluster distortion statistics are much 
closer to uniform than are either the cardinality or the average cluster 
distortion statistics. 

Based on the results shown in Figs. 5 through 8, it was decided to 
implement the HMM recognizer using a M = 64 VQ, since the small 
decrease in average distortion from M = 64 to M = 128 did not justify 
the increased computation owing to the larger codebook. 

Figure 9 shows some properties of the LPC vectors in the codebook 
for M = 64. Shown in this figure are plots of the first few resonances 
of the 64 codebook entries (part a), and plots of first versus second 
resonance (part b), first versus third resonance (part c), and second 
versus third resonance (part d). As we anticipated, typical vowel 
resonances for the digits are seen clearly in the plots (e.g., high front 
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Fig. 7—Plots of (a) cell average distortion versus codebook index, and (b) histogram 
of cell average distortion for 128-codeword vector quantizer. 


vowels, low back vowels, etc), along with characteristic resonances of 
transient and other nonvoiced sounds. Detailed examination of the 
spectra of the 64 codebook entries did not provide any further en- 
lightenment as to the VQ properties. 


V. OVERALL HMM/VQ ISOLATED WORD RECOGNIZER 


A block diagram of the overall HMM/V6Q isolated word recognizer 
is given in Fig. 10. The recognizer operates as a speaker-independent 
word recognizer, which runs first in a training mode, to provide the 
codebook entries of the VQ, and the model coefficients of each word 
HMM. 

In the classification mode the LPC sets of the unknown word are 
first sent through the vector quantizer (to give a finite set of VQ 
indices) and then scored on each word HMM (using either the Viterbi 
scoring or the Baum-Welch scoring) to give a probability score for 
each word model. The decision rule chooses the word whose model 
gives the highest probability. 

In the next section we describe the results of several tests designed 
to measure the performance of the HMM/VQ word recognizer and to 
compare it with that of a conventional LPC/DTW recognizer. 
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Fig. 8—Plot of cell total distortion versus codebook index for 128-codeword vect 
quantizer. 


VI. EVALUATION EXPERIMENTS AND RESULTS 


Several evaluation tests were performed on the HMM/VQ and 
LPC/DTW isolated word recognizers. For most conditions a single test 
set of data, denoted as TS1, was used, consisting of one replication of 
each of the 10 digits by a set of 100 talkers. These talkers were the 
same ones used to train the recognizer; however, the test replication 
was recorded many days after the training replication. A second test 
set of data, denoted as TS2, was used in a couple of tests. This test set 
consisted of 20 replications of each of the 10 digits by a set of 10 new 
talkers (5 male, 5 female). Thus, TS2 contained twice as many test 
tokens as TS1, but represented only one-tenth the number of talkers; 
however, none of the talkers was included in the training set for either 
the VQ or the HMMs. 

The results presented in this section are the output of a series of 
recognition tests in which one or more features of the HMM/VQ 
recognizer were varied. Following the presentation of the results of 
each of the individual experiments, we shall endeavor to provide a 
measure of coherency to the results. 


1090 THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1983 


3.33 


FREQUENCY IN KILOHERTZ 
Fo IN KILOHERTZ 





F3 IN KILOHERTZ 
F3 IN KILOHERTZ 





F, IN KILOHERTZ Fo IN KILOHERTZ 


Fig. 9—Plots of locations of first three resonances of codebook vectors: (a) plotted as 
a function of the vector index, (b) plotted in the F, — F»2 plane, (c) plotted in the F; — 
Fs plane, and (d) plotted in the F2 — Fs plane. 
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Fig. 10—Overall block diagram of the hidden Markov model—vector-quantizer iso- 
lated word recognizer. 
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6.1 Effects of constrained and unconstrained HMMs 


The first set of experiments sought to understand the effects of 
placing constraints on the A and B matrices on the performance of the 
overall recognizer. As such, the HMM was trained for an N = 5 state 
model with the following constraints: 

UC model: No constraints placed on A; epsilon constraints 
[of the type given in eq. (10)] placed on B. 
CO model: The constraints of eq. (9a) placed on A; epsilon 
constraints of eq. (10) placed on B. 
UC.35 model: Same as UC model but all training sequences with 
VQ distortions greater than 0.35 were eliminated. 
CO.35 model: Same as CO model but all training sequences with 
VQ distortions greater than 0.35 were eliminated. 
For each of the above four models, the 1000-digit sequences of TS1 
were used to measure the overall error rate as a function of the ¢ 
constant parameter. The results are given in Fig. 11, which shows plots 
of error rate versus € (on a log scale) for each of the four models. 
Several trends clearly emerge from these results. First, we can see that 
a nonzero value of ¢ is an absolute necessity for obtaining good 
performance. Whenever a symbol, k (a VQ index), appears in a test 
word in a state, 7, where b;(k) = 0, the probability for that word model 
is multiplied by the ¢ value. If « = 0 then the word model is eliminated 
from consideration and an error occurs. For finite, nonzero values of 
e, however, such errors need not, and generally will not, occur. Hence, 


20 


ERROR RATE IN PERCENT 





1073 10-4 10-5 1076 190710 10-°° 
€ 


Fig. 11—Plots of average word error rate versus the minimum value of the symbol 
probability matrix, e, for four types of hidden Markov models, for TS1 data. 
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even for € on the order of 10~”, the error rate is substantially smaller 
than for € = 0. 

A second clear trend that can be seen in Fig. 11 is that the con- 
strained serial models performed consistently better than the uncon- 
strained models for all nonzero values of «. This result implies that the 
extra freedom of the unconstrained models tends to raise the proba- 
bility scores for incorrect words more than for correct words. This is 
somewhat reminiscent of the fact that opening up the search region of 
a conventional DTW search helps the wrong words much more than 
it does the correct words.!® 

It can also be seen from Fig. 11 that it is always preferable to train 
with all available sequences. This result suggests that the more training 
data given to the HMM model estimation algorithm the better the 
estimates of the HMM parameters, even if some of the data are less 
than ideal. 

The final trend that emerges from the curves of Fig. 11 is that there 
is a large range of values of € for which essentially identical perform- 
ance results. For example, in the range 10°’° < e = 10°, for model CO, 
the recognition error rate changes by less than 1.6 percent. Thus, so 
long as ¢€ is in this broad rang, the exact value of ¢€ is not overly 
significant. 

Based on the results of this first series of experiments, we applied 
the following restrictions: 

(t) Consider only constrained HMMs 
(ii) Constrain B matrix entries such that b,(k) = « = 10° 

(tit) Use all possible training sequences for the HMMs. 

Before we proceed to the next series of experiments, some comments 
should be made about practical methods of implementing constraints 
on the B matrix. In Ref. 13 we show how the constraints on the 0,() 
coefficients, of the type given in eq. (10), can be incorporated directly 
into either the gradient or the Baum-Welch reestimation algorithm. 
We have also tried a post-normalization technique in which no con- 
straints were placed directly on the B matrix. Following convergence, 
the B matrix was examined and all entries whose values were below € 
were reset to the value e, and the rows of the matrix were suitably 
renormalized to sum to 1.0. Our recognition results indicate identical 
performance for both the direct and the post-normalization constraint 
methods. Hence, there appears to be no advantage to constraining the 
B matrix directly. 


6.2 Markov model with variable number of states 


The second set of experiments consider the effects on recognition 
accuracy of using the constrained serial HMMs with different numbers 
of states. In particular we computed the optimum SC1 model (see Fig. 
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4) for each digit where the number of states varied from 2 to 9. We 
also computed the optimum SC1 model for a 20-state model for each 
digit. 

To evaluate these different models a recognition test was conducted 
in which each digit was represented by a single N-state HMM, where 
N took on the values 2 to 9, and 20. The results of this experiment are 
given in Fig. 12a, which shows word error rate versus number of states 
in the HMM for the data of TS1. We see a steady but slow decrease in 
the average word error rate in this curve. We also see a statistical 
fluctuation in the curve owing to the sampling variability in the A’s 
and B’s for each word HMM. (We will return to this issue later in this 
section.) 

A second recognition test was conducted in which each word was 
represented by all the word models with number of states up to some 
maximum value, NMAX. The results of this recognition test (again 
using TS1 data) are shown in Fig. 12b. A somewhat smoother curve of 
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Fig. 12—Plots of (a) error rate versus the number of states in the HMM, and (b) 
error rate versus the maximum number of states in the HMMs for TS1 data. 
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errors versus NMAX is obtained in Fig. 12b than for the individual 
models of Fig. 12a; however, the general behavior of both curves is 
similar. 

Figure 13 shows a breakdown of the error rates for each digit for the 
experiment in which a single HMM, with N states, was used for each 
digit. There is a highly complex interaction between error rate and 
model size for all digits. Thus it cannot be argued, for instance, that 
digits like zero and seven (two syllables) need more states in their 
models than digits like two or one (monosyllables). 


DIGIT 


ERROR RATE tN PERCENT 





N, NUMBER OF STATES IN HMM 


Fig. 13—Individual plots of digit error rates versus the number of states in the HMMs. 


WORD RECOGNITION 1095 


From the results shown in Figs. 12 and 13, it was concluded that 
there is very little gain in using HMMs with more than five or six 
states when the SCI structure for each model is being used. It was also 
concluded that no simple relationship existed between word (digit) 
accuracy, number of sounds (syllables, etc) in the word, and number 
of states needed in the word HMM. 


6.3 Effects of random starting points 


The third set of experiments was concerned with the statistical 
variability in the performance scores resulting from statistical varia- 
bility in the parameters of the word HMMs because of different 
random initial estimates of the parameters. To quantify this effect, a 
5-state SC1 model was generated for each digit using 10 different 
random starting sets of model parameters. Thus, for each digit, 10 
“equivalent” HMMs were created. 

A recognition test was then run, using TS1 data, in which each of 
the 10 models was tested separately. Also tested was the case in which 
all 10 models were used for each digit, as well as the case in which a 
single model was used for each digit, where the model parameters were 
obtained by averaging the parameter estimates for each of the 10 word 
models. The results of these recognition tests are given in Fig. 14, 
which shows word error rate versus the random start number. Also 
shown, as single isolated values, are the error scores for the average 
and the combined 10 state runs. The dashed line in Fig. 14 is the 
average error rate of the 10 individual models. 

From the data of Fig. 14, we can see that the 10 individual models 
all performed identically to within +1 percent; hence, the expected 
statistical variability in error rate scores, due to random starts, should 


ERROR RATE IN PERCENT 


ALL 10 STARTS ~__ 
AVERAGE-~_ 





0 1 2 3 4 5 6 7 8 9 10 11 12 
RANDOM START NUMBER 
Fig. 14—Plots of error rate for 10 different random starting sets of values of the 


HMM parameters. Also shown are individual error rates for an averaged model and for 
combining 10 random start models. 
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be on the order of +1 percent. We can also see that the performance 
of the “averaged” model (obtained by averaging HMM parameters for 
each word) was somewhat poorer than that of any of the 10 individual 
models. The fact that using all 10 word models for each digit gives an 
error rate comparable to that of the best word models indicates that 
multiple word models provide a small gain that comes at the cost of 
greatly increased computation. 

Overall, the data of Fig. 14 suggest that a single model per word 
should be adequate for most purposes, and that the effects of different 
random starts on the overall error performance are small. 


6.4 Parallel constrained HMMs 


The last factor investigated was the effect of using constrained 
parallel HMMs for each digit. The main idea was that a true parallel 
structure could model the effects of using a multiplicity of word models 
in much the same way as multiple templates are used in the conven- 
tional LPC/DTW word recognizer. 

Figure 15a shows the 5-state constrained serial HMM that was used 
in previous experiments along with a 7-state, constrained parallel 
HMM (Fig. 15b), and an 8-state, constrained parallel model (Fig. 15c). 
The 7-state parallel structure was intended to represent four distinct 
5-state word models, in that there were four sets of paths through the 
model, namely, 1-2-5-7, 1-2-6-7, 1-3-5-7, and 1-3-6-7. The 8-state parallel 
structure was intended to represent eight distinct 5-state word models 
in that there were eight sets of paths through this model, namely, 
1-2-4-6-8, 1-2-4-7-8, 1-2-5-6-8, 1-2-5-7-8, 1-3-4-6-8, 1-3-4-7-8, 1-3-5-6-8, 
and 1-3-5-7-8. 

Each of the three HMM structures of Fig. 15 was used to generate 
a word model for each of the digits. The three sets of models were 
then tested using TS1 data. The results showed each of the three 
systems obtained the same word error rate (3.5 percent) to within +0.1 
percent. These results indicated that there was really no advantage in 
using the parallel structure. 


6.5 Comparison with LPC/DTW recognizers 


To provide some basis of comparison for the performance of the 
HMM/VQ recognizer with that of more conventional word recognizers, 
the data of TS1 was tested on the LPC/DTW recognizer of Fig. 1. The 
reference set consisted of 12 templates per digit, generated from a 
clustering analysis of the 100 tokens of each digit in the training set 
(the same training set used to train each word HMM). The decision 
rule was the nearest neighbor rule (KNN = 1).”” 

Tables I and II show average word recognition accuracies for the 
following three recognizers: 
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Fig. 15—State diagrams for (a) simple constrained serial 5-state model, (b) con- 
strained parallel 7-state model, and (c) constrained parallel 8-state model. 


(t) HMM/VQ using a constrained serial structure with five states 
for the HMM and the 64-element VQ. 
(it) LPC/DTW, the conventional recognizer. 

(tit) LPC/DTW/VQ, the conventional recognizer with both refer- 
ence and test patterns quantized using the same VQ used in the HMM 
case. 

The results shown in Table I are for the 1000 digits of TS1, whereas 
those shown in Table II are for the 2000 digits (10 talkers) of TS2. The 
results given in Table I show that for TS1, both the LPC/DTW and 
HMM recognizers, when using the VQ, achieved essentially the same 
digit accuracy; however, the LPC/DTW system without the VQ 
achieved a 2-percent higher word accuracy. For TS2 in Table II the 
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Table |—Comparison of results on HMM/VQ and 
LPC/DTW word recognizers for [average word 
accuracy (%)] TS1, 100 talkers, 10 digits per talker 


Recognizer 
Digit HMM/VQ LPC/DTW LPC/DTW/VQ 
0 98 99 99 
1 98 98 99 
2 96 100 96 
3 99 99 97 
4 93 97 96 
5 97 96 93 
6 96 100 94 
7 99 100 94 
8 92 98 96 
9 95 98 96 
Average 96.3 98.5 96.5 


Table Il—Comparison of results on HMM/VQ and 
LPC/DTW word recognizers for TS2, 10 talkers, 
200 digits per talker 


Recognizer 
Talker HMM/VQ LPC/DTW  LPC/DTW/VQ 
1 74.5 96 87 
2 99.5 100 99.5 
3 94 99 97 
4 89 99 91 
5 95.5 100 100 
6 95.5 99 99.5 
7 100 100 99.5 
8 91.5 96 93 
9 91.5 99.5 93.5 
10 96.5 98.5 98 
Average 92.8 98.7 95.5 


results show that the VQ led to a 3.2-percent reduction in accuracy for 
the LPC/DTW system, and an additional 2.7-percent loss in accuracy 
for the HMM system. A good deal of the loss in accuracy, however, 
was contributed by talker 1, whose accuracy was 87 percent for the 
LPC/DTW/VQ recognizer, and 74.5 percent for the HMM/VQ recog- 
nizer. With only 10 talkers, the influence of a single talker on the 
overall accuracy may be substantial. 
An analysis of the actual errors of all three of the recognizers of 
Tables I and II shows the following: 
(t) Of the 37 tokens misclassified by recognizer HMM/VQ in TS1, 
31 were correctly identified by the LPC/DTW recognizer, and 27 were 
correctly identified by the LPC/DTW/VQ recognizer. 
(it) The vast majority (25) of the 37 errors made by recognizer 
HMM/VQ were cases in which the probability of the correct word was 
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lower than the probability of the incorrect word by a factor of e® or 
larger. 

The results show that when the LPC/DTW recognizer (either with or 
without the VQ) has incorrectly identified the word, most of the time 
the HMM recognizer has correctly identified the word. Hence, a side 
result of this work is that the HMM/VQ model can be combined with 
an LPC/DTW model to provide word accuracies greater than either 
individual recognizer could obtain. In particular, for the data of TS1, 
the combined recognizer could have achieved a 99.4-percent word 
accuracy by using appropriate decision logic” on all cases in which 
both recognizers did not agree. 

The same sort of trends are noted in the errors of TS2. Of the 124 
errors made by the HMM/VQ recognizer, 113 were correct in the 
LPC/DTW or equivalently of the 26 errors made by the LPC/DTW 
recognizer, 15 were correct in the HMM/VQ system. Hence, again a 
combined system could potentially achieve an accuracy of about 99.5 
percent on TS2 data. Considering that even the talkers in the TS2 
data were different from those in the training set, this accuracy appears 
to be quite remarkable. 

In summary, comparisons between the HMM/VQ and LPC/DTW 
recognizers indicate that without the VQ, the LPC/DTW recognizer 
achieves from 2- to 6-percent higher accuracy than the HMM/VQ 
system; with the VQ the differential in accuracy is from 0 to 3 percent. 


Vil. DISCUSSION 


In this paper we showed that the techniques of vector quantization 
of LPC vectors and hidden Markov modeling can be combined in a 
simple, straightforward manner to implement a speaker-independent, 
isolated word recognizer. With adequate training of the vector quan- 
tizer and the Markov model estimation algorithm, a digits vocabulary 
can be recognized, with accuracies of from 93 to 96 percent across a 
wide variety of talkers. Direct comparisons with a conventional linear 
predictive coding recognizer using dynamic time warping for time 
alignment with multiple templates for each vocabulary word showed 
that the HMM/VQ recognizer performs only a little worse (0.2 percent 
in one test, 2.7 percent in another) than the LPC/DTW recognizer 
when using the VQ. Without the VQ, the LPC/DTW recognizer was 
about 2 to 3 percent better than when the VQ was used. 

Several general conclusions can be drawn from the results. The first 
is that the HMM/VQ recognizer performed exceedingly well on the 
difficult task of speaker-independent recognition of isolated digits. The 
fact that the overall performance of the HMM/VQ recognizer was 
somewhat poorer than the LPC/DTW/VQ recognizer appears to be 
primarily because of the insufficiency of the HMM training data. 
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Although 100 training sequences per word are adequate for a clustering 
analysis, as used in the conventional recognizer, they appear to be 
inadequate for obtaining good HMM models for these words. This 
suggestion is made plausible by considering what is being estimated in 
the HMM for each word. For an N-state Markov model, with M finite 
symbols per state, a total of N? + NM parameters must be estimated. 
(Of course with constraints there are somewhat fewer parameters.) 
For N = 5, M = 64, we need 345 parameters to be estimated from 
about 100 x 40 frames of VQ indices. The “curse of dimensionality” 
would imply that this amount of training data is woefully inadequate. 
[We have seen one consequence of this inadequacy in that we had to 
use the e-constraints of eq. (10) on any 0,;(k) whose value fell below the 
e threshold. ] In view of this, the fact that we achieve the results we are 
getting is rather remarkable. 

A second conclusion that can be drawn from the results is that the 
use of the VQ on the LPC sets leads to a small, but not insignificant, 
degradation in performance of both the HMM/VQ and LPC/DTW/ 
VQ recognizers as compared to the conventional LPC/DTW system. 
This suggests the need for using more than 64 vectors in the codebook 
or resorting to continuous models of the LPC parameters. 

The results have shown that the errors made by the HMM/VQ and 
LPC/DTW recognizers are largely disjoint. Here there exists the 
potential of using some fairly standard techniques to combine the two 
recognizers into one whose accuracy is as good as the best of both 
recognizers on any given word.” This topic merits further considera- 
tion. 

The experimentation with various forms of the Markov models used 
in the recognizer showed fairly conclusively that: 

(t) Constrained models (with constrained transition matrices) per- 
formed consistently better than unconstrained models. 

(zz) A finite minimum constraint on the state symbol probability 
matrix was a necessity for good system performance. 

(iii) The effects of different random starting values for the HMM 
parameters were negligible in evaluating overall recognizer perform- 
ance. 

(tv) The required number of states in each word HMM needed to 
be on the order of 5. More states did not lead to significant improve- 
ments in performance. 

(v) Parallel HMM structures yielded no real improvements over 
cascade structures, thereby indicating that an equivalent of multiple 
HMMs is not readily obtainable by simply changing the model struc- 
ture. 

(vi) The Viterbi scoring and the Baum-Welch scoring of test se- 
quences give essentially identical performance. 
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7.1 Computational considerations of the HMM/V@Q recognizer 


It is worthwhile estimating the storage required and the computation 
needed to process an unknown test utterance using the HMM/VQ 
recognizer, and to compare them to the requirements of the conven- 
tional LPC/DTW recognizer. Our intention is to provide only a rough 
estimate of computational expense. A number of straightforward re- 
ductions in computation can be achieved for each recognizer through 
the judicious use of table storage. Also we ignore overhead owing to 
index computation, etc. 

The first extra step in the HMM/VQ recognizer (after conventional 
LPC analysis) is vector quantization of the unknown test pattern. If 
we assume there are T frames in the test word, and M codebook entries 
in the VQ, then we need a total of 


C, = M-T(p + 1) (17) 


multiplications* (where p = LPC order) to perform the M-T dot 

products required to get the best codebook entry for each frame. 
Evaluation of the word HMM score, using the Viterbi scoring 

method with the constrained A matrix, requires approximately 


C2 = T:-N:-3 (18) 


multiplications and logarithms per word model, where N is the number 
of states in the model, and the factor 3 accounts for the number of 
valid transitions into a given state. For a vocabulary of V words a total 
of 


C, =M-T-(p+1)+V-T-N-3 (19a) 

Chas = Vy. T-N:-3 (19b) 

multiplications and logarithms are required. For M = 64, T = 40, V = 

10, N = 5, and p = 8, eq. (19) gives C, = 29040 multiplies and Cog = 
6000 logarithms. 


For a conventional LPC/DTW recognizer with @ templates per 
vocabulary word, the computation for DTW processing is 


C3 = Q-V-T?/3-(p + 1), (20) 


which for Q = 12 and other parameters the same as above gives C3 = 
576000 multiplications. Hence, the HMM/VQ recognizer requires 
about 17 times less computation (assuming logarithms are equivalent 
to multiplications) than the LPC/DTW recognizer. 

With regard to storage, the HMM/VQ recognizer requires 


*In this simplified analysis we neglect additions and comparisons of data and use 
multiplication count as the measure of computation. 
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Si = M(p + 1) (20a) 
So = (M-N + 3N)-V, (20b) 


where S, is the storage (in words*) for the VQ codebook entries, and 

S2 is the storage (in words) for the set of V word HMMs. For the given 

values of the parameters, the total storage is S = S, + Sp, = 4106 words. 
For the conventional LPC/DTW recognizer the storage is 


S3=@-V-T-(p + 1), (20c) 


which gives 43,200 words for the assumed parameter values. Again we 
see a 10 to 1 reduction for the HMM/VQ model over the LPC/DTW 
model. 

It should be noted that in our analysis of computation we have not 
included the computation of LPC coefficients. This computation must 
be performed in “real-time” and is independent of the vocabulary size, 
V. Hence, for a sufficiently large vocabulary the computation for 
scoring each word dominates the overall computation, and the rough 
analysis given above is appropriate. Furthermore, the computation for 
coding each LPC vector into the nearest codebook entry [eq. (17)] is 
also independent of vocabulary size and often could be performed in 
the “real-time” part of the recognizer. For such implementations the 
gain in computation of the HMM/VQ recognizer, over the conventional 
LPC/DTW recognizer, is even higher than our simple analysis predicts. 
Finally, it is straightforward to show that if we compare the compu- 
tation of the HMM/V@Q recognizer with that of the LPC/DTW/VQ 
recognizer, assuming that the VQ is done in real time and that tabular 
computation of products and logarithms is used, then by comparing 
the number of additions, the computational advantage of the HMM/ 
VQ system over the LPC/DTW/VQ system still holds. 


7.2 Some comments on the relationships between DTW and HMMs 


Contemporary research on speech recognition has produced two 
algorithmic procedures for dealing with the nonstationarity of the 
speech signal: temporal alignment techniques, and Markov modeling. 
These methods display certain superficial similarities (e.g., both use 
dynamic programming methods, can be cast in a Bayesian framework, 
and have a state transition network associated with them), as a result 
of which it has occasionally been claimed that they are identical. To 
the best of our knowledge, the experiments reported here represent 
the first direct comparison of the two methods. From these experi- 
ments it is abundantly clear that the methods are not identical. While 


* Words of storage refer to unquantized floating point data. 
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their overall performances are comparable, they appear to make dif- 
ferent errors and involve different amounts of computation and differ- 
ent complexities of training. 

In all of these respects the two methods reflect the dichotomy 
between parametric and nonparametric methods of pattern recognition 
in the sense of Patrick.” In Markov modeling we assume that there is 
a family of models of a particular structure differing only in the values 
of their parameters. We use a large training set to estimate these 
parameters and assume that, if correctly done, the parameters will 
capture the structure of the data. The training procedure is computa- 
tionally expensive but need be done only once. After training is 
complete, relatively little computation is required to determine 
whether an unknown observation was generated by the model. 

Temporal alignment methods are opposite in the following ways. 
We assume that there is an underlying structure to the training data 
but its form is unknown. We attempt to capture that structure by 
simply storing one or more samples and measuring their “distance” to 
an unknown sample with a metric that is sensitive to the distinctive 
features of the categories that we seek to identify. The metric is 
monotonically related to the class conditional density functions so that 
minimum distance corresponds to maximum model probability. In this 
case training is a computationally simple data collection and storage 
process. Probability computation, on the other hand, is very costly 
since we must measure the distance to every prototype in the training 
set. 

All of these characteristics are made manifest by our experiments. 
What remains to be determined is whether the parity of these methods 
extends to more difficult problems of speech recognition. We hope to 
answer that question by further experimentation. 


Vill. SUMMARY 


We have described the results of an extensive investigation into the 
applicability of the techniques of vector quantization and hidden 
Markov modeling to speaker-independent, isolated word recognition. 
We have shown that, when properly designed, the resulting recognition 
system produces highly accurate word recognition on a vocabulary of 
isolated digits. We have also discussed the effects of variations of 
model parameters on system performance. Our experiments show that 
the resulting recognizer requires about 10 times less storage, and about 
17 times less computation for classifying a test utterance than does an 
equivalent recognizer using LPC coding and dynamic time warping. 
These economies are obtained at the expense of only a slight increase 
in error rate. 
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A Lithographic Mask System for MOS Fine-Line 
Process Development 


By J. M. ANDREWS 
(Manuscript received September 23, 1982) 


A mask set, incorporating a group of seven test chips, has been 
designed for fine-line process development and process control. Al- 
though six lithographic levels are available, the masks are generally 
intended to be used only in subsets of two or three levels to minimize 
the delay encountered in obtaining electrical test results for whichever 
processes require investigation. The mask levels serve a variety of 
purposes for special process development experiments. Available 
structures include: metal-oxide-semiconductor capacitors, p-n junc- 
tions, guarded and unguarded Schottky barrier diodes, ohmic con- 
tacts, van der Pauw patterns, insulated gate field-effect transistors, 
gated diodes, resistors for sheet resistance and linewidth variations, 
and tapped electromigration test strings. It is not anticipated that a 
process engineer should ever need more than a maximum of four 
levels to achieve an appropriate experimental structure for process 
development. It is not the purpose of these masks to establish fine- 
line design rules. The masks are intended to be used primarily with 
standard photolithographic processing, and most device structures 
have been designed to tolerate up to 5 um in misalignment errors. 
However, certain selected features have been coded in a diminishing 
sequence to a minimum of 1.0 ym for special fine-line investigations. 
A salient feature of this mask system is the option to interleave rapid 
turnaround photolithographic steps with fine-line X-ray patterning; 
therefore, some mask levels have been reissued for X-ray lithography. 


1. INTRODUCTION 


In the past, the development of new silicon integrated circuit proc- 
esses was impeded by the fact that an adequate set of simple test 
structures usually could not be fabricated without resorting to the full 
set of six lithographic levels required by the Poon Tester Chips.’ This 
set sometimes requires several months to fabricate if X-ray lithography 
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is used. If device wafers could not be sacrificed, the processing engineer 
had to resort to simulating device structures, either by using metal 
dots of fixed areas on unpatterned oxides or deposited films, or by 
other schemes such as the use of offset circular windows using a pair 
of photolithographic steps.”* 

A set of photolithographic masks has been designed and is now 
available to fill the process development gap. The goal has been to 
provide the processing engineer with the means to simulate critical 
processing steps by introducing a monitor wafer, prepared in advance 
by one photolithographic step, and usually requiring only one more 
lithographic step of any type to obtain a structure ready for electrical 
testing. 

The full set of fine-line process development masks consists of six 
photolithographic levels, but these have been designed to be utilized 
in subsets of 2, 3, or 4 levels only. Available structures include metal- 
oxide-semiconductor (MOS)* capacitors, contact windows, guarded 
and unguarded Schottky diodes, van der Pauw patterns, insulated gate 
field-effect transistors (IGFETs), gated diodes, and tapped electro- 
migration test strings. Also available are large areas accessible for 
direct probing, for evaporation of MOS dots, or for Auger, scanning 
electron microscope (SEM), and transmission electron microscope 
(TEM) studies. 

Section II contains a complete description of each of the devices 
that are available in the MOS mask set. The organization of the 
devices among each of seven test chips, the chip designations, the 
device assignments, wafer layout, design rules, and the alignment 
features are discussed in Section III. Section IV consists of a detailed 
description of each mask level separately, including the primary pur- 
pose for which the level is intended and the features available. Section 
IV also includes specific recommendations regarding which levels can 
be omitted with respect to the particular devices required or the 
experimental intent. Specific applications to silicon integrated circuit 
processing are discussed in Section V, and the MOS mask system is 
summarized in Section VI. 


ll. DEVICE DESCRIPTION 


Most of the test structures in the lithographic mask system are MOS 
capacitors, because such two-terminal devices are easily fabricated. 
Normally, only two photolithographic operations are needed, and the 
first could usually be done on a large number of wafers before specific 
experiments are planned. Furthermore, the same patterns can be used 


* Acronyms and abbreviations are defined fully in the Glossary at the end of this 
paper. Tables and figures are located at the end of the text. 
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to fabricate p-n junctions, Schottky diodes, or ohmic contacts to the 
substrate by simply omitting the gate oxidation. All such devices, with 
dimensions ranging from 1 to 500 pm, are contained on a single chip 
designated A. Larger devices, with dimensions ranging from 1000 to 
4000 pm, are included on chips B and C1 through C4. The aim has 
been to provide maximum experimental flexibility by including a wide 
range of available device dimensions, which permits electrically active 
areas to span more than seven orders of magnitude. A summary of all 
devices contained in the MOS mask system is presented in Table I in 
the form of a device key, listing MOS device, nominal dimension, the 
chip assignment, and the appropriate probing pad number. Detailed 
descriptions of each device follow in Subsections 2.1 to 2.5. Dimen- 
sional data on all device structures are contained in the pad keys 
shown in Tables II through V. 


2.1 Sixfold MOS capacitor group (HEXCAP) 


The design of the lithographic mask system has evolved from a 
sixfold set of MOS capacitors (HEXCAP) that can be implemented at 
almost any stage of device processing to provide electrical characteri- 
zation of dielectric layers. 


2.1.1 FOXCAP and GOXCAP 


The field oxide capacitor (FOXCAP) and gate oxide capacitor (GOX- 
CAP) are shown in Fig. 1. For this pair of capacitors, the dimension L; 
increases through the sequence 50, 100, 200, 500, 1000, 2000, and 4000 
pm. In all cases the dimension Lz = L; + 10 pm, to conform with 
relaxed design rules to minimize registration errors (see Section 3.3). 
All FOXCAP and GOXCAP devices with Li = 500 pm are contained 
on the A chip, with peripheral probing pad locations coinciding with 
the Poon Tester’ and the Process Monitor* chips. To facilitate MOS 
characterization of thick oxides or deposited dielectrics, a limited 
number of HEXCAP devices with L; equal to 1000 and 2000 um were 
assigned to a larger chip B with area equal to that of ten standard 
Poon chips. The largest FOXCAP and GOXCAP devices on the Cl 
chips (see Section 3.1.3) are not square. Rectangular GOXCAP dimen- 
sions L3 and L, for the largest devices have been selected so that 


LsL4 = (4000 pm)?. (1) 


The rectangular structures of FOXCAP and GOXCAP are shown in 
Fig. 16. Specific values for L3 and L4 were selected so that the overall 
Cl chip dimensions could be adjusted to accommodate the contact 
metallization test chip (METEST) (see Sections 3.1.4 and 3.2). 

In each HEXCAP group, the portions of the polysilicon (POLY) 
areas that overlay the gate and source and drain (GASAD) areas are 
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equal. Thus, the capacitors formed over gate oxide (Ccox) are equal 
among all elements of each HEXCAP group except FOXCAP. Fur- 
thermore, the total POLY areas are equal for FOXCAP, GOXCAP, 
and the devices with comb-shaped electrodes (POLYCOMB) to facil- 
itate use with automatic testing programs. Thus, constant parasitic 
capacitance contributed by the field oxide (Crox) in parallel with Ccox 
has been maintained among GOXCAP, GOXCOMB, and POLY- 
COMB (see following Section 2.1.2) for convenience in software devel- 
opment. Detailed dimensional data are contained in the pad keys, 
Tables II, III, and IV. 

It should be clear from Fig. 1 that omission of the gate oxide in 
GOXCAP results in a structure suitable for ohmic contact, Schottky 
barrier diode, and p-n junction experiments. For this reason the 
dimension L, also decreases through the sequence 20, 15, 10, 7, 5, 4, 3, 
2, 1.5, and 1.0 ym on the A chip. It is not expected, however, that the 
smallest GOXCAP windows would be routinely resolved by standard 
photolithographic processes. To resolve the smallest features, the 
GASAD mask level has been reissued for X-ray lithography. Because 
the X-ray target is nearly a point source, the finite separation between 
X-ray mask and the silicon substrate results in a slight magnification 
of the X-ray image. To compensate for the magnified X-ray image, a 
small demagnification of all features on the X-ray mask was required 
to make the X-ray lithographic level compatible with previous or 
subsequent optical lithographic levels. Alignment in the X-ray expo- 
sure facility, however, is done optically. Therefore the spacing between 
alignment features on the X-ray mask must remain the same as on the 
optical mask. With these modifications, X-ray lithography can be 
interleaved with more easily performed photolithographic steps. 

The use of the GOXCAP structure to form a guarded Schottky 
diode is illustrated in Fig. 2, in which a guard ring with width W is 
diffused into the substrate before GASAD lithography. Guard-ring 
width options available in the mask set decrease through the sequence 
10, 7, 5, 3, 2, and 0 pm (see Section 4.1). In all cases the guard ring has 
been centered on the GASAD boundary so that half its width extends 
into the contacting metal and reduces the effective contact area (see 
Fig. 2), so that: 


Aer = (Li — W)? (um’). (2) 


For the cases in which W = J, the structure simply reduces to a p-n 
junction. Square or rectangular features have also been included in 
the GUARDRING level under the FOXCAP structures to provide the 
option of fabricating buried channel MOS capacitors”® or to investigate 
the profiles of various ion implantations, such as those used to control 
threshold or punchthrough or for depletion loads. 
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2.1.2 GOXCOMB and POLYCOMB 


To study peripheral effects or defects, structures with expanded 
gate-oxide/field-oxide periphery were included as the third HEXCAP 
element. This structure, abbreviated GOXCOMB, is shown in Fig. 3. 
On the A chip, the GOXCOMB structures comprise rectangular ele- 
ments of width d, = 25 um, which are spaced dz = 10 ym apart. The 
total gate oxide areas of the two GOXCOMB structures on the A chip 
are (200 um)? and (500 ym)”. To further enhance peripheral effects, 
GOXCOMB structures have been included on the B and C2 chips with 
d, = 5 pm and d2 = 10 pm. Because of the decreased filling factor 
associated with reducing d, to 5 pm, it was not practical to keep the 
total gate oxide areas equal to the areas of the associated 1000-, 
2000-, and 4000-um GOXCAP structures; the actual areas are listed in 
Tables III and IV. 

The POLYCOMB structure is similar to GOXCOMB except that 
the increased perimeter or peripheral expansion occurs at the gate- 
oxide/polysilicon boundary. The structural detail of POLYCOMB, the 
fourth HEXCAP element, is shown in Fig. 4. The chip assignments for 
POLYCOMB are the same as for GOXCOMB except for the 4000-um 
structure, which is on the C3 chip. The same values for d; and d2 apply 
to both COMB structures. 


2.1.3 OVLAP and NOVLAP with FIELD PLATE 


In some cases it is desirable to minimize parasitic capacitance in an 
MOS structure, i.e., the parallel capacitance composed of the area in 
which the gate electrode overlaps field oxide. The last pair of HEXCAP 
devices has been designed to minimize parasitic capacitance, consistent 
with the design rules discussed in Section 3.3. The OVLAP capacitor, 
shown on the left in Fig. 5, has a 5-um overlap of the gate electrode 
onto the surrounding field oxide. The NOVLAP capacitor, shown on 
the right, has a gate electrode that has been retracted 5 ym from the 
GASAD boundary. The portion of the gate electrode that covers gate 
oxide in the OVLAP capacitor is equal to the area of the gate electrode 
in the NOVLAP capacitor. Both OVLAP and NOVLAP must be 
probed directly, because they are completely surrounded by a field 
plate that can be used to control the surface potential near the edges 
of each capacitor. 

If a metallic silicide is formed in place of gate oxide, the resulting 
structure allows investigation of the effects of overlying metallization 
when excessive metallic penetration at contact window edges is sus- 
pected.? For these investigations it may be useful to include the 
GUARDRING option prior to silicide formation. 
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2.2 Sheet resistance group (SADSHEET and POLYSHEET) 


Two three-terminal structures on the A chip can be used to obtain 
sheet resistance data from lines 400 um long. Thus, accurate sheet 
resistance measurements can be made, even though the linewidths 
may deviate from the coded values because of unknown degrees of 
overetching or other process variations. 

The left side of Fig. 6 shows the structure for measuring polysilicon 
sheet resistance (POLYSHEET). It consists of two 400-um lines in 
series, but the coded linewidths are different: W; = 5 um and W2 = 8 
pm. After processing, the actual linewidths may differ from the coded 
linewidths by a constant amount, e. Assume that a positive value for 
€ corresponds to a linewidth loss from the coded value, W;. If the 
resistance of each line is measured, it is possible to solve for both the 
sheet resistance and a constant linewidth loss, e. It can be shown that 

RiR2 W2- Wi Rik: 


= ——__.. —____. = _7.5x10-* ——__ (Q 3 
Rs R.— R. ZL 5x10 RR, &) (3) 


and 
a RiW,— RoW. _ 5Ri — 8Re 
Ri — R2 R, — Re 


A GUARDRING structure is included beneath some of the POLY- 
SHEET lines. If the GUARDRING option were used, for example, to 
etch channels of various widths in a field oxide, the resulting POLY- 
SHEET structure would provide information on poly-Si linewidths 
within oxide channels or straddling oxide steps. 

Source-and-drain sheet resistance and linewidth variations can be 
determined from measurements on the structure shown on the right 
side of Fig. 6. The coded dimensions of the SADSHEET lines are 
exactly the same as for POLYSHEET, and eqs. (3) and (4) apply. 

When the GUARDRING level precedes GASAD, some of the SAD- 
SHEET lines are imbedded into the guard-ring diffusion. Such a 
structure could be useful in determining the sheet resistance of a 
metallic silicide line for cases in which a low Schottky barrier height 
between the silicide and the substrate would interfere with electrical 
measurements. 


(um). (4) 


2.3 van der Pauw group (VANDERPAUW) 


The four-terminal symmetric structure shown in Fig. 7 has been 
provided on the A chip to make accurate determinations of polycrys- 
talline silicon (poly-Si) sheet resistance in a way that is independent 
of the actual shape of the resistive pattern.!° The poly-Si lines have 
been extended to the probing pads so that window and metallization 
lithography is not necessarily required. However, the option to have 
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overlying metallization on the lines leading to the van der Pauw 
pattern, with windows to the underlying poly-Si, is available for 
unusual circumstances in which polysilicon sheet resistance may be 
very high. 

The other VANDERPAUW structures are on the C4 chip, and 
consist of GUARDRING and GASAD patterns. The combinations 
POLY/GUARDRING and POLY/GASAD VANDERPAUW patterns 
are also included on the C4 chip to enable measurements of inversion 
layer sheet resistance’ and to investigate CHANSTOP performance. 


2.4 IGFET group 


It is not anticipated that the IGFET group of devices will be utilized 
as often as the HEXCAP group of MOS capacitors, because the 
IGFETs require a minimum of four mask levels (GASAD, POLY, 
WINDOW, and METAL). For this reason, all IGFETs have been 
relegated to the B chip that has more available terminals than the A 
chip, although it may be less convenient for automatic probing. 

Most of the IGFETs are included in one group with common sources 
and gates. The structure of the IGFET with Z = 20 pm is shown in 
Fig. 8. All gates are 100 um wide, and the gate lengths L descend 
through the sequence 20, 15, 10, 8, 6, 5, 4, 3, 2, 1.5, and 1.0 um. It is not 
anticipated, however, that the shortest gates will be resolved with 
ordinary photolithographic processing. Therefore, the POLY mask 
level may also be reissued for X-ray lithography. Even with wide 
variations in processing, the range of gate lengths provides a means to 
determine the true (electrically active) channel length from a plot of 
B versus the coded value for L, where £ is the transconductance of 
the IGFET. 

The GUARDRING option is available on all elements of the IGFET 
group. The guard ring straddles the GASAD feature on the three sides 
that are not adjacent to the gate, as shown in Fig. 9. Such a structure 
may be useful to minimize edge leakage when Schottky barrier sources 
and drains are investigated. 

In some cases it is useful to make C-V (capacitance measured as a 
function of voltage) measurements of the gate electrode in an active 
IGFET. But practical IGFETs are generally designed to minimize gate 
capacitance in order to maximize switching speed, and the true gate 
capacitance is difficult to separate from parasitic capacitance. There- 
fore, four large-gate IGFETs have also been included on the B chip 
with gate dimensions descending through the sequence 500, 300, 200, 
and 100 pm square. 


2.5 Gated diode group (GATODE) 


The measurement of the depleted surface recombination velocity 
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So” is especially useful in evaluating the effectiveness of a low-temper- 
ature anneal to reduce surface state density” and to investigate the 
effects of radiation damage.’*”’ In the determination of so it is necessary 
to directly control the surface potential near an MOS capacitor by 
means of a third electrode. This option has been made available by 
means of the gated diode group (GATODE), with dimensions decreas- 
ing through the sequence 500, 300, 200, and 100 um square. The 
structure of the 100-um gated diode is shown in Fig. 10. Obviously, a 
nearly equivalent structure could be realized by simply shorting the 
source to the drain of one of the large area IGFETs. The GATODE 
structures on the B chip differ from the IGFETs, however, in that the 
source-and-drain diffusions completely surround the gate electrode 
except for a 10-um tab that connects the gate electrode to the probing 
pad. The GATODE structures are inverted from the usual gate-con- 
trolled diode in the sense that the p-n junction surrounds the gate 
electrode, whereas, the original gate-controlled diode structure con- 
sisted of an MOS capacitor in the form of a ring that surrounded a 
p-n junction.’*’” The advantage of the GATODE structure is that 
better control of minority carrier production is possible when primary 
interest is centered on the properties of deeply depleted MOS capaci- 
tors.'*"? 


2.6 Contact metallization test chip (METEST) 

Electromigration studies generally require high current densities, of 
the order of 10° A/cm’, to achieve accelerated aging at a practical 
rate.”” In the vicinity of contact windows, electromigration has been 
difficult or impossible to study, because the only test structure avail- 
able has been the 100-window arrays on the Poon Tester A and C 
chips.’ At the required current density, the sum of the voltage drops 
accumulated over a 100-window array often exceeds the breakdown 
voltage of the p-n junction that exists beneath each pair of windows. 
The contact metallization test chip D (METEST) has been designed 
to avoid large accumulated voltage drops by means of a tapped string 
of metal-to-diffusion windows, as shown in Fig. 11. Each tapped string 
is composed of series combinations of 1, 2, and 4 contact cells. Struc- 
tural detail of one such contact cell is shown in Fig. 12 for a window 
size of 7 ym. With the D chip, a reliability engineer can select 2, 4, 6, 
8, 10, or 14 windows in series, depending upon the particular breakdown 
characteristics of the structure. Each tapped string has been repro- 
duced for a variety of contact window sizes, decreasing through the 
sequence 7, 5, 3, 2, 1.5, and 1.0 ym square. It is not expected, however, 
that the smallest windows would be routinely resolved by standard 
photolithographic processes. Therefore, the POLYCON mask level, 
which contains the contact windows for the METEST structures, has 
been reissued for X-ray lithography. The X-ray alignment features 
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have been modified appropriately to make the X-ray lithographic level 
compatible with prior GASAD and subsequent POLY optical litho- 
graphic levels. 

There has been a tendency to avoid rectangular contacts with large 
aspect ratios, i.e., L/W > 3. The reason is related to photolithographic 
exposure problems with very small contact windows, such as W S 2 
pum. When additional contact area is required, parallel strings of square 
contact windows are often preferred to large, rectangular contacts. For 
this reason, the smallest contact windows on the D chip have been 
repeated in multiples of 4, 6, and 8 parallel windows for the 2-, 1.5-, 
and 1.0-um windows, respectively. Structural detail of a multiple- 
window contact cell is shown in Fig. 13 for 4-window, 2-1m contacts. 
Obviously, the current does not divide evenly among the windows in 
such a multiple-window structure, but the extra contact windows can 
be regarded as providing an experimental backup when the first 
window fails. The multiple window strings also tend to increase con- 
tinuity probability when working close to the limit of lithographic 
resolution. 


ill. ORGANIZATION 

The lithographic mask system for fine-line process development has 
been organized on the wafer so that the simplest structures with the 
most convenient dimensions are available together on chip A. The 
included structures are HEXCAP, SADSHEET, POLYSHEET, and 
VANDERPAUW (see Section II). Perhaps the most unusual feature 
of the mask organization has stemmed from the enormous range of 
device sizes that have been made available to maximize experimental 
flexibility. Thus, structural dimensions ranging from 1 ym to 4000 pm 
are all present together on the same wafer. Furthermore, the largest 
areas can be used for direct probing, for evaporated MOS dots, or can 
be easily cleaved for Auger, SEM, TEM, X-ray, and other analytical 
investigations. It is the large range of device sizes (4-1/2 orders of 
magnitude) which has dictated chip designation and wafer layout. 


3.1 Test chip designation 
3.1.1 Chip A (1600 x 4096 pm) 


Most of the devices with dimensions ranging between 1.0 and 500 
pm are included on the A chip. A composite view of the POLY and 
WINDOW levels of the A chip is shown in Fig. 14. Both the dimensions 
of this chip and the placement of the 36 probing pads have been 
selected to coincide with the Poon Tester’ and the Process Monitor* 
chips to facilitate automated probing with existing probe cards. 


3.1.2 Chip B (8000 x 8192 1m) 
The dimensions of the B chip are integral multiples (5 X 2) of the A 
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chip to facilitate wafer layout (see Section 3.2), and the area is equal 
to that occupied by ten A chips. A composite view of the POLY, 
WINDOW, and METAL levels of the B chip is shown in Fig. 15. The 
size of the B chip has been selected to accommodate HEXCAP groups 
measuring 1000 and 2000 um square, in the case of FOXCAP, GOX- 
CAP, OVLAP, and NOVLAP. The GOXCOMB and POLYCOMB 
capacitors are nearly square, and have been laid out so that the 
equivalent areas are equal to the areas of the square capacitors (see 
Table III). Please note that to provide adequate resolution for illustra- 
tion in Fig. 15, the width and spacing of the tines in POLYCOMB have 
been magnified 3X, and the number of tines has been accordingly 
reduced by a factor of 3 so that the overall dimensions remain un- 
changed. Accurate dimensional data on POLYCOMB can be measured 
from Fig. 15 by scaling down detail 3X. Areas and perimeters are listed 
in Table III. Also the gap between the field plate and the OVLAP and 
NOVLAP capacitors has been widened 3X. The IGFET arrays and 
gated diodes were assigned to the B chip for two reasons: (i) at least 
four photolithographic steps are required to realize completed devices, 
so it is anticipated that these will not be used as frequently as the two- 
level structures on the A chip; (zi) the IGFET and gated diode arrays 
require 23 additional probing pads that are not available on the A chip. 
To provide adequate resolution for illustration, the spacing between 
gates and sources and drains has been increased 3X in Fig. 15. In the 
gated diodes the space between gates and junctions has also been 
increased 3X. 


3.1.3 Chip C (6400 xX 8192 pm) 


The largest MOS capacitors, with areas measuring (4000 ym)’, had 
to be allocated to four separate chips. The C1 chip contains FOXCAP 
and GOXCAP capacitors with areas equal to (4000 pm)”. A composite 
view of the POLY and WINDOW levels is shown in Fig. 16. The C2 
chip contains a rectangular GOXCOMB structure with area somewhat 
reduced from the rectangular devices on chip C1; the exact coded areas 
are listed in Table IV. A composite view of the POLY and WINDOW 
levels is shown in Fig. 17. The C3 chip contains a rectangular POLY- 
COMB structure with area somewhat reduced from the rectangular 
devices on chip Cl; the exact coded areas are listed in Table IV. A 
composite view of the POLY and WINDOW levels is shown in Fig. 18. 
To provide adequate resolution for illustration in Fig. 18, the width 
and spacing of the tines have been magnified 3X, and the number of 
tines has been accordingly reduced by a factor of 3 so that the overall 
dimensions remain unchanged. Accurate dimensional data can be 
measured from Fig. 18 by scaling down detail 3X. Areas and perimeters 
are listed in Table IV. 
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The C4 chip contains both OVLAP and NOVLAP rectangular 
capacitors with areas measuring (4000 um)” and surrounded by a field 
plate. A composite view of GUARDRING, GASAD, POLY, and WIN- 
DOW levels is shown in Fig. 19. For the purpose of illustration the gap 
between the field plate and the OVLAP and NOVLAP capacitors has 
been widened 3X in Fig. 19. All C chips measure 6400 X 8192 ym. 
When combined with the D chip (see next section), the overall dimen- 
sions of the combination are exactly equal to the dimensions of the B 
chip or a 2 X 5 array of A chips. 


3.1.4 Chip D(1600 x 4096 pm) METEST 


A composite view of the POLY and WINDOW levels of the contact 
metallization test chip D (METEST) is shown in Fig. 20. Both the 
dimensions and the locations of the 36 probing pads of the METEST 
chip have been selected to coincide with the Poon Tester’ and the 
Process Monitor‘ chips to facilitate automatic probing. It is expected 
that the tapped strings with 2-, 3-, 5-, and 7-~wm windows will be used 
most extensively. These strings have been terminated on probing pad 
numbers 3, 4, 5, 6 (3 wm), 9, 10, 11, 12 (2 wm), 21, 22, 23, 24 (7 um), and 
27, 28, 29, 30 (5 um) to coincide with existing metallization probing 
cards. 

Detailed dimensional data for the D chip are listed in Table V. 
Table V differs from Tables II through IV in many respects, because 
the tapped strings were not intended to provide capacitance data. 
There are no GUARDRING features. Entries tabulated under GASAD 
tub refer to features straddling the indicated pad numbers, although 
all tubs within each string are connected in parallel after metallization 
with the POLY level. Entries tabulated under POLYCON window 
refer to the total cross-sectional area of a single tub input or output. 
However, current density is not expected to be uniform over any given 
window and especially among multiple windows. Entries tabulated 
under POLY are intended to aid in estimating string resistances from 
the sheet resistance of the metallization layer provided by the POLY 
pattern. Taps and ties are defined in Fig. 11, and the equivalent 
numbers of squares straddling each pair of pad numbers are indicated. 
The equivalent number of squares for the contact areas were not 
included, because these depend upon the sheet resistance of the 
underlying tubs. 


3.2 Wafer layout 


The location of each of the test chips described in the preceding 
section is shown in Fig. 21. The A chip is the most numerous, totaling 
130 and arranged in blocks of 10 to form the cross-shaped pattern 
coded AX in Fig. 21. The number X, following A, denotes the width of 
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the guard ring when the GUARDRING (N35) option is selected. The 
symbol AO denotes unguarded devices. The arrangement of the A 
chips, which have been laid out to permit automatic probing with 
existing facilities, is obviously intended to reveal horizontal and vertical 
parametric trends on test wafers. 

There are only twelve B chips, which contain 1000- and 2000-um 
devices. When the GUARDRING (N35) option is selected, there are 
only two chips for each guard-ring width, i.e., 2, 3, 5, 7, and 10 um. As 
in the case of the A chip, guard-ring width is represented by X in the 
notation BX, shown in Fig. 21. The asterisks denote undefined guard- 
ring diffusions or implantations that cover the entire chip for evalua- 
tion of guard-ring performance without a parallel Schottky diode. 

There are six Cl and six C4 chips on the test wafer. In each case, 
three are unguarded and three have 10-4m guard rings, viz. C1-0 and 
C1-10 in Fig. 21. The C2 and C3 chips contain large GOXCOMB and 
POLYCOMB structures, respectively. Two are unguarded, viz. C2-0, 
and two have guard rings, viz. C2-10, in Fig. 21. In the case of 
GOXCOMB, the guard-ring width does not permit interdigitating the 
individual COMB elements, so the guard-ring option provides a buried 
diffused tub beneath the GASAD structure. As in the B chip, the 
asterisks denote undefined guard-ring diffusions or implantations. 

The contact metallization test chip is denoted D in Fig. 21. There 
are a total of 40, which have been divided equally among the four 
quadrants of the wafer. 

At the top and bottom and left and right are four alignment patterns 
that have been designed to permit aligning any mask to any other in 
any order (see Section 3.4). Also associated with each alignment 
pattern are two TEM test chips that have been specially designed to 
facilitate transmission electron microscopic (TEM) analysis.”! (See 
Section 3.5.) 


3.3 Design rules 


The lithographic mask system has been designed specifically for 
fine-line process development and process control. Since the masks are 
generally to be used with standard photolithographic processing, it 
obviously would be inappropriate to interpret data in terms of fine- 
line design rules. Consequently, most device structures have been 
designed to tolerate up to 5 ym in misalignment errors. Figure 22 is an 
example of such relaxed design rules, showing the source-gate structure 
of a typical element from the IGFET group (see Section 2.4). The 
same structure:also applies to the junction contacts of the gated diode 
group (GATODE) (see Section 2.5). In general, all contact windows 
have a minimum width of 5 pm, and all overlapping regions are a 
minimum of 5 pm. 
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A number of exceptions exist, however. Most prevalent are the 
GOXCAP group (see Section 2.1.1), which has been deliberately con- 
tinued to a minimum size of 1.0 um to enable special experiments with 
nonstandard photolithographic processes or X-ray lithography. A sim- 
ilar philosophy has been applied to the entire contact metallization 
test chip D (METEST), which has been fully described in Section 2.6. 


3.4 Alignment features 


It is hoped that maximum flexibility has been achieved by the use 
of a modified version of the standard Perkin-Elmer projection (PEP) 
alignment features. These modified PEP (MOPEP) features are shown 
in Figs. 23a through f and are presented in the anticipated “normal” 
order or suggested sequences, i.e. GUARDRING, GASAD, POLY- 
CON, POLY, WINDOW, and METAL. The upper set of MOPEP 
features in each of Figs. 23a through f corresponds to the “normal” 
processing sequence. Unlike alignment procedures for virtually all 
device codes, each mask in this lithographic system must be aligned to 
the immediately preceding level, because levels prior to the one im- 
mediately preceding introduce overlapping patterns. But the unique 
feature of the MOPEP alignment features is that any number of levels 
can be skipped. For example, it has been anticipated that a popular 
sequence may be GASAD followed by POLY only. The alignment 
feature remaining on the test wafer after GASAD lithography is shown 
in Fig. 23b. Alignment of POLY to GASAD corresponds to a “normal” 
processing sequence, so the second MOPEP feature in the upper half 
of Fig. 23d would have to be aligned to the second (right-hand) 
MOPHP feature in the upper half of Fig. 23b. The left-hand MOPEP 
feature in Fig. 23b is simply ignored, because the GUARDRING level 
was omitted. 

The lower set of MOPEP alignment features in Figs. 23a through f 
have been included to enable an “inverted” processing sequence. Such 
an “inverted” processing sequence might be required for some unique 
or novel structure that was not originally intended or anticipated. For 
example, POLY features can be defined on the surface of an unpat- 
terned field oxide. After oxidation, or deposition of an intermediate 
dielectric layer, it might be necessary to define additional conductive 
features of either poly-Si or metal directly over the original poly-Si 
features. This capability is available by using a “GASAD” level with 
reverse tone (see Table VI), consisting of opaque features within a 
transparent background, to produce conductive patterns in polysilicon 
or metal. Alignment is carried out by inserting the central MOPEP 
feature in the lower half of Fig. 23b into the right-hand MOPEP 
feature on the lower half of Fig. 23d, which would be the pattern left 
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on the test wafer after POLY lithography. (Tone reversal is not applied 
to MOPEP features.) 

To be consistent with the relaxed 5-ym design rules discussed in the 
preceding section, the right-hand MOPEP feature in the upper halves 
of Figs. 23a through f are all 25 ym wide. All of the other MOPEP 
features in the upper halves of Fig. 23b through e and all MOPEP 
features in the upper half of Fig. 23f are 20 um wide. A similar scheme 
that provides 2.5-um frames for alignment in inverted order applies to 
the lower halves of Figs. 23a through f. The 2.5-um alignment frame 
has resulted from a compromise that should offset the effects of 
photolithographic processing variations, but proper alignment does 
require some judgment on behalf of the alignment operator to optimize 
registration of sequential mask levels. 


3.5 TEM test chip 


Sample preparation techniques for transmission electron microscopy 
(TEM) usually produce sections sufficiently thin over a region that 
may vary between 40 and 100 ym. All morphological features essential 
for process evaluation can be translated into the area for TEM study 
by a special test pattern 1 mm wide and approximately 6.7 mm long, 
and with a structural period of 29.5 ym. The TEM test pattern is 
shown in Fig. 24, showing gate oxide within regions defined by GASAD, 
contact windows formed by POLYCON, layers of poly-Si defined by 
POLY, and subsequent P-glass and metallization levels. Thus all 
windows, steps, and other peripheral features normally encountered in 
fine-line process development are reproduced over 200 times within 
each TEM test chip. Two TEM test chips have been placed symmet- 
rically with respect to each MOPEP alignment feature (see Fig. 21), 
and no active device areas have been sacr?‘iced. A total of eight TEM 
test chips have been incorporated into th : lithographic mask system, 
and the feature boundaries of each TEM test chip are oriented or- 
thogonal to a (100) cleavage plane so that the cross section shown in 
Fig. 24 can be readily obtained from widely separated areas of the 
wafer. The TEM test chip shown in Fig. 24 differs from the one 
published by Sheng and Marcus,” partially because the chip in Fig. 24 
was designed for a fine-line process that does not involve selective 
oxidation. 


IV. MASK LEVELS 


The six mask levels that comprise the normal tone portion of the 
lithographic mask system are listed in the upper part of Table VI and 
are intended to be used with positive photoresist. The suggested 
sequence reflects the primary purpose for which each level was in- 
tended; a few examples are shown in Table VIJ. Each mask level 
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contains four full sets of MOPEP alignment features, corresponding to 
“normal” and “inverted” processing sequences. Thus, it is possible to 
align any mask level to any other, thereby permitting novel device 
structures with processing sequences that have not been anticipated. 
(See Section 3.4 for alignment details.) The three levels that have been 
issued in reverse tone are shown in the lower part of Table VI. They 
have been intended for use with negative photoresist, uniform gold 
metallization, selective oxidation or other special processes. 


4.1 GUARDRING 


The GUARDRING level would generally be omitted for most MOS 
processing investigations. In a sense, it may be viewed as being anal- 
ogous to the isolation tub diffusion that occurs in CMOS processing 
prior to GASAD. The principal purpose of the guard ring is to provide 
p-n junctions that straddle the boundaries of Schottky barrier diodes 
to electrically isolate metallization edges that often obscure barrier 
characterization. The GUARDRING level is unusual because it com- 
prises a subset of six patterns that provide guard-ring widths of 10, 7, 
5, 3, 2, and 0 wm. The location of each of these subsets is indicated in 
Fig. 21 by the final hyphened integers X, i.e., C1-X. Each guard ring is 
located such that it frames each GASAD boundary symmetrically. In 
the case of the smallest GASAD features, the area enclosed by the 
larger GUARDRING features vanishes, and a p-n junction is formed, 
which is useful for evaluating guard-ring performance. The asterisks 
denote undefined diffused or implanted areas that cover the entire 
chip. These chips can also be used to characterize the guard rings 
independently from the other features or to evaluate ion implantation 
profiles (see Section 2.1.1). 


4.2 GASAD 


The GASAD level is normally the first level that would be used for 
MOS process development and monitoring. The principal purpose of 
the GASAD level is to open up areas in the field oxide in preparation 
for a possible ion implantation, for threshold control, followed by gate 
oxidation. If gate oxidation is omitted, however, the GASAD level 
provides a range of areas for investigations of contact resistance, 
Schottky barrier diodes, and p-n junctions. Most of the patterns 
provided by GASAD are square and progressively increase in size 
through the sequence 1.0, 1.5, 2, 3, 4, 5, 7, 10, 15, 20, 50, 100, 200, 500, 
1000, 2000, and 4000 ym. At 200 um and above, comb-like structures 
(GOXCOMB) are included in the GASAD level for investigation of 
peripheral effects and defects. The 200- and 500-um GOXCOMB 
structures consist of a series of 25-um slots separated by 10 um of field 
oxide (see Fig. 3). All of the slots are interconnected to guarantee 
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equalization of surface potential. The choice of the relatively large 
slots assures that all of the MOS capacitors within HEXCAP will have 
nearly equal gate areas despite typical variations in linewidth owing to 
processing variables. Obviously, GOXCOMB can be utilized for aggra- 
vating edge effects or defects in device structures. In the largest 
GOXCOMB structures (1000, 2000, and 4000 um) the slots have been 
reduced to 5 ym, separated by 10 um of field oxide, to further increase 
the aggravation caused by peripheral electric fields and edge-related 
defects. 

It is anticipated that in most MOS process monitoring and devel- 
opment applications it would be possible to use wafers that had been 
previously patterned using the GASAD mask. Thus, the most impor- 
tant device structures would be complete after only one additional 
photolithographic step (see POLY, Section 4.4). 


4.3 POLYCON 


For most MOS monitoring or process development applications the 
POLYCON level would be omitted. The main reason for including this 
level has been to provide windows to the underlying GASAD tub 
diffusions or ion implantations that are required by the metallization 
test chip D (METEST). However, when investigation of p-n junctions 
with large areas or extended peripheries is required, the GASAD mask 
is used to define the diffused or ion-implanted regions. In this case the 
POLYCON level provides the required contact windows to junctions 
formed by the GASAD features. Evaluation of p-n junctions thus 
requires a minimum of three mask levels, because the POLY level 
must be used for metallization. If it should be necessary to investigate 
the effects of high-temperature processing after poly-Si deposition, 
such as oxidations and/or insulating depositions, additional mask 
levels WINDOW and METAL might be required. 


4.4 POLY 


It has been anticipated that the combination of GASAD followed by 
POLY would be the most widely used sequence of mask levels for 
MOS process development. For this reason, the POLY patterns have 
been extended to include the probing pads so that the poly-Si can be 
probed directly. Typical probe-spreading resistance measurements are 
of the order of 50Q in a poly-Si film with a sheet resistance of 20 Q2/ 
LI, providing that the probe tips are sufficiently hard and sharp 
enough to pierce 50A of native oxide that typically occurs on the 
surface of n*-poly-Si. For this purpose tungsten carbide probes are 
recommended with tip radii of 5 X 10~* cm or less. The use of palladium 
probes with planar tips 5 x 10°° cm in diameter has been found to be 
unsatisfactory for probing n*-poly-Si. Unfortunately, many probe 
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cards designed for automatic probing cannot be used for probing n*- 
poly-Si directly, because erratic probe contact resistance may range 
over many orders of magnitude, sometimes exceeding 10 megohms. 
Regular cleaning and inspection of probe tips are mandatory when 
probing n*-poly-Si. Ion implantation causes an amorphous layer of 
semi-insulating material to occur near the surface of n*-poly-Si. Thus, 
any n*-poly-Si that has been exposed to ion implantation must be 
annealed at 950°C in Nez for 30 minutes to avoid excessive probe 
resistance. In automatic probe stations with existing probe cards, the 
POLY mask level may be used to pattern aluminum, with or without 
an intervening poly-Si layer, to reduce probe resistance. Alternatively, 
the WINDOW and METAL mask levels can be used (see next sec- 
tions). 

At 200 wm and above, comb-like structures (POLYCOMB) are 
included in the POLY level for investigating edge effects or defects 
(see Fig. 4). The width and separation of individual elements are the 
same as for GOXCOMB (see GASAD, Section 4.2). 


4.5 WINDOW 


Occasionally, it is necessary to investigate changes in device char- 
acteristics resulting from oxidations, insulating depositions, or anneal- 
ing after poly-Si definition. For this purpose the WINDOW level has 
been provided, which opens 90-~m square windows over each probing 
pad. In most cases the poly-Si could then be probed directly at the 
contact pads without resorting to aluminum deposition and lithogra- 
phy, providing the recommendations contained above in Section 4.4 
are followed. In the case of the OVLAP and NOVLAP MOS capacitors 
surrounded by field plates, the capacitors must be probed directly. For 
this purpose, the WINDOW level also contains large contacting areas 
over each capacitor surrounded by a field plate. Occasionally, it may 
be beneficial to access the van der Pauw pattern with aluminum 
metallization. Therefore, the WINDOW level also contains four con- 
tacts oriented directly over the leads at the edges of the van der Pauw 
structures to provide conduction to overlying metal lines leading to 
the probing pads. The WINDOW level is also required for contacts to 
the sources and drains of IGFETs, the junction terminal of the GA- 
TODKEs, and to SADSHEET. 


4.6 METAL 


As in the case of the GUARDRING and POLYCON levels, it is 
expected that the METAL level could be omitted in most MOS 
monitoring or process development applications. The principal excep- 
tions include IGFETs and gated diodes, in which structures the re- 
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quired electrical continuity could not have been provided by the POLY 
level alone. 

In most cases the metal has been excluded from any areas where 
poly-Si is in contact with gate oxide. The reason for this exclusion 
stems from experimental evidence that aluminum, sintered into poly- 
crystalline silicon, sometimes deteriorates the dielectric breakdown 
strength of underlying gate oxides, especially if the gate oxide is very 
thin (ie., < 250A). Exception occurs for all of the MOS capacitors 
surrounded by field plates, which must be probed directly. These 
OVLAP and NOVLAP capacitors thus provide experimental structures 
appropriate for cases in which it is necessary to compare the break- 
down voltages in MOS capacitors with and without overlying metalli- 
zation. 


V. APPLICATIONS 


Most basic circuit elements utilized by unipolar semiconductor in- 
tegrated electronics can be easily fabricated with an appropriate subset 
of the lithographic mask system. These elements fall into five general 
classifications: MOS capacitors, p-n junctions, contacts, sheet resistors, 
and IGFETs. The contact class can be further subdivided to include 
guarded and unguarded Schottky diodes, ohmic contacts, and contact 
metallization test cells. The sheet resistor class can be subdivided to 
include polysilicon sheet resistors (POLYSHEET), source and drain 
sheet resistors (SADSHEET), and van der Pauw patterns in the 
GUARDRING, GASAD, and POLY levels. The IGFET class also 
includes gated diodes. These thirteen subdivisions of unipolar device 
structures are shown in the first column of Table VII. The remaining 
six columns show the required mask levels needed to realize a partic- 
ular device structure. Other mask levels are generally optional, but 
some may be required for certain experiments. 

At the present time the lithographic mask system is in wide use, and 
more than fifty experiments have been initiated in the Advanced 
Large-Scale Integration (LSI) Development Laboratory using one or 
more levels. References 22 through 26 contain published experiments 
that have utilized this mask system. 


Vi. SUMMARY 


Fine-line MOS process characterization, determination of base-line 
parameters, and new process development can be efficiently carried 
out using the lithographic mask system. By selecting an appropriate " 
subset of photo- or X-ray lithographic mask levels, most unipolar 
semiconductor circuit elements can be fabricated in an enormous range 
of sizes. X-ray and trilevel lithographic processes are used only when 
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absolutely required, and these can be interleaved with photolitho- 
graphically defined patterns. Any mask level can be aligned to any 
other, and any number of mask levels can be skipped. Registration 
tolerance is 5 um for most device structures. Most experimental device 
structures can be completed and ready for electrical evaluation in a 
fraction of the time required to fabricate the elements on the Poon 
Tester chips,’ which are included within the array of fine-line device 
chips and require six X-ray lithographic levels to complete. 
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Table |—Device key* 


Nominal Chip 
MOS Device Dimension Assignment Pad No(s). 
HEXCAP Group 
1. FOXCAP 4000 C1 93 
2000 B 4l 
1000 B 3 
500 A 2 
200 A 32 
100 A 26 
50 A 19 
2. GOXCAP 4000 Cl 42 
2000 B 49 
1000 B 7 
500 A 4 
200 A 31 
100 A 29 
50 A 20 
20 A 25 
15 A 1 
10 A 3 
7 A 5 
5 A 7 
4 A 8 
3 A 10 
2 A 11 
1.5 A 12 
1.0 A 14 
3. GOXCOMB 4000 C2 67 
2000 B 106 
1000 B 34 
500 A 6 
200 A 30 
4. POLYCOMB 4000 C3 68, 69, 70, 71 
2000 B 95, 96 
1000 B 22 
500 A 9 
200 A 29 
5/6. OVLAP and NOVLAP with 4000 C4 14 
FIELD PLATE 2000 B 75 
1000 B 13 
500 A 13 
200 A 33 
100 A 28 
50 A 18 
SHEET Grou 
1. SADSHEET 80SQ A 15 
COM A 16 
50SQ A 17 
2. POLYSHEET 80S A 36 
CO. A 35 
50SQ A 34 
VANDERPAUW Group 
1. POLYSI 250 A 21 
22 
23 
24 
2. GASAD 250 C4 101 
102 
103 
104 
3. GUARDRING 250 C4 oe 
98 
99 


* All dimensions are in micrometers. Exact areas and perimeters are listed in Tables 
II, Til, and IV. 
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Table |-Device key *(Continued) 


Nominal Chip 
MOS Device Dimension Assignment Pad No(s). 
4. POLYSI and GUARDRING 250 C4 ee 
89 
90 
91 
92 
93 
94 
5. POLYSI and GASAD 250 C4 is 
77 
78 
79 
80 
81 
82 
IGFET Group Channel 
Terminal Width Length 
1. Drain 100 20 B 87 
100 15 B 86 
100 10 B 85 
100 8 B 84 
100 6 B 83 
100 5 B 82 
100 4 B 81 
100 3 B 80 
100 2 B 79 
100 1.5 B 78 
100 1.0 B 77 
Common Gate 100 B 88 
Common Source 100 B 76 
2. Source 500 500 B 53 
Gate B 54 
Drain B 55 
3. Source 300 300 B 56 
Gate B 57 
Drain B 58 
4. Source 200 200 B 59 
Gate B 60 
Drain B 61 
5. Source 100 100 B 62 
Gate B 63 
Drain B 64 
GATED DIODE Group Channel 
Terminal Width and Length 
1. Junction 500 B 65 
Gate B 67 
2. Junction 300 B 68 
Gate B 69 
3. Junction 200 B 70 
Gate B 71 
4, Junction 100 B 72 
Gate B 73 
METEST Group Window Size 
1. Uni-Window 2 D 9-12 
3 D 3-6 
5 D 27-30 
7 D 21-24 
2. Quad-Window 2 D 7, 8, 25, 26 
3. Hex-Window 15 D I, 2, 35, 36 
4. Oct-Window 1 D 31-34 


* All dimensions are in micrometers. Exact areas and perimeters are listed in Tables 
I, II, and IV. 
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Table II—Pad key for chip A* 


GUARDRING (Width: 2) 


11,525 
FOXCAP 271,350 

GOXCAP 11,200 

GOXCAP 21,350 248,004 
GOXCAP 11,005 ; 25 
GOXCOMB 361,705 111,705 229,954 
GOXCAP 10,900 10,875 

GOXCAP 10,826 10,810 

POLYCOMB 271,350 21,350 

GOXCAP 10,754 10,745 

GOXCAP 10,684 10,680 

GOXCAP : 10,649.75 10,647.5 

FIELD PLATE 111,500 111,500 

OVLAP 260,100 10,100 

NOVLAP 250,000 0 

GOXCAP : 10,616 10,615 

SADSHEET 10,000 10,000 

SADSHEET 10,000 10,000 12,450 
SADSHEET 10,000 10,000 11,225 
FIELD PLATE 32,750 32,750 10,000 
OVLAP 3,600 1,100 3,600 
NOVLAP 2,500 0 2,500 
FOXCAP 14,850 14,850 10,000 
GOXCAP 14,850 12,350 10,000 
VANDERPAUW 137,675 137,675 30,912.5 
VANDERPAUW 137,675 137,675 8,587.5 | 15,587.5 
VANDERPAUW 137,675: 137,675 8,437.5 | 11,2125 
VANDERPAUW 137,675 137,675 8,602.5 | 17,462.5 
GOXCAP 12,250 11,850 8,100 | 10,000 
FOXCAP 23,350 23,350 8,100 10,000 
GOXCAP 23,350 13,350 8,100 | 10,000 
FIELD PLATE 37,750 37,750 8,100 | 10,000 
OVLAP 12,100 2,100 8,100 12,100 
NOVLAP 10,000 0 8,100 | 10,000 
POLYCOMB $5,350 15,350 8,100 | 10,000 
GOXCOMB 67,600 27,600 8,100 | 10,000 
GOXCAP $5,350 15,350 8,100 | 10,000 
FOXCAP $5,350 $5,350 8,100 | 10,000 
FIELD PLATE $6,787.5 $6,787.5 8,100 | 10,000 
OVLAP , 44,100 4,100 36,100 | 44,100 
NOVLAP 40,000 0 36,100 | 40,000 
POLYSHEET 8,100 | 10,000 
POLYSHEET 35,216 35,216 8,100 10,000 
POLYSHEET 8,100 | 10,000 


w 





* All dimensions are in micrometers. 
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Table It—Pad key for chip A* 








GUARDRING (Width: 3) 


GUARDRING (Width: 5) 


(Continued) 


GUARDRING (Width: 7) 





GOXCAP 180 
FOXCAP 263,169 
GOXCAP 120 
GOXCAP 6,000 247,009 
GOXCAP 84 16 
GOXCOMB 60,150 219,934 
GOXCAP 60 
GOXCAP 48 
POLYCOMB 7,104 
GOXCAP 

GOXCAP 

GOXCAP 

FIELD PLATE 

OVLAP 

NOVLAP 

GOXCAP 

SADSHEET 

SADSHEET 

SADSHEET 

FIELD PLATE 

OVLAP 

NOVLAP 

FOXCAP 

GOXCAP 

VANDERPAUW 

VANDERPAUW 

VANDERPAUW 

VANDERPAUW 

GOXCAP 

FOXCAP 

GOXCAP 

FIELD PLATE 

OVLAP 

NOVLAP 

POLYCOMB 

GOXCOMB 

GOXCAP 

FOXCAP 

FIELD PLATE 

OVLAP 

NOVLAP 

POLYSHEET 

POLYSHEET 

POLYSHEET 


* All dimensions are in micrometers. 


265,225 
200 25 
10,000 245,025 
140 4 
100,250 199,900 

100 

81 

11,840 

64 

49 


0 
245,025 
255,025 


420 
267,289 
280 
14,000 243,049 
196 t) 
140,350 179,874 
144 tt) 
121 0 
16,576 342,216 
100 
81 
72.25 
tt) 
14,000 
14,280 
64 
6,130 





6,888 
0 
1,400 
1,680 
4,489 
1,400 
66,049 
66,049 
66,049 
66,049 
560 
13,689 
2,800 
0 
2,800 
3,080 
6,650 
22,750 
5,600 
47,089 
0 
5,600 
5,880 
6,030 


4,800 


270,400 
400 
20,000 
289 
200,560 
225 

196 
23,680 
169 

144 


132.25 


0 
20,000 
20,400 
121 
7,850 


9,020 
) 
2,000 
2,400 
4,900 
2,000 
67,600 
67,600 
67,600 
67,600 
800 
14,400 
4,000 
0 
4,000 
4,400 


8,000 
0 


8,000 
8,400 


6,000 


GUARDRING (Width: 10) 


240,100 
i) 
149,850 
0 
0 
338,715 
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MOS Device 





FOXCAP 
GOXCAP 

FIELD PLATE 
OVLAP 
NOVLAP 
POLYCOMB 
GOXCOMB 
FOXCAP 
GOXCAP 
IGFET: SOURCE 
IGFET: GATE 
IGFET: DRAIN 
IGFET: SOURCE 
IGFET: GATE 
IGFET: DRAIN 
IGFET: SOURCE 
IGFET: GATE 
IGFET: DRAIN 
IGFET: SOURCE 
IGFET: GATE 
IGFET: DRAIN 
GATOD: JUNC. 
GATOD: GATE 
GATOD: JUNC. 
GATOD: GATE 
GATOD: JUNC. 
GATOD: GATE 
GATOD: JUNC. 
GATOD: GATE 
FIELD PLATE 
OVLAP 
NOVLAP 


IGFET: COMSOURCE 


IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 
IGFET: DRAIN 


IGFET: COMGATE 


POLYCOMB 
GOXCOMB 


— l!lI—Pad key for chip B* 





aes POLY 





0 
1,000,000 
0 
1,000,000 
1,020,100 
2,027,970 
675,000 

0 
4,000,000 
0 

275,000 


U0) 


15,000 
278,200 


107,200 


51,700 


16,200 


0 
4,000,000 
4,040,100 

0 

5,100 
5,150 
5,200 
5,300 
5,400 
5,500 
5,600 
$,800 
6,000 
6,500 
7,000 

0 
8,028,720 
2,676,190 


ais bec Sauce 


Window 
Area 





0 1,031,350 4,490 | 1,031,350 
4,000 1,031,350 4,490 31,350 re 
ty) 209,000 15,1701 209,000 0 
4,000 , 1,020,100 4,040 20,100 | 1,000,000 
4,040 , 1,000,000 4,000 0 | 1,000,000 
5,766 706,350 | 270,927 31,325 | 675,025 
270,020 2,038,220 6,256 | 1,363,220 | 675,000 
4,051,350 8,490 | 4,051,350 0 
8,000 4,051,350 8,490 51,350 | 4,000,000 


265,500 2,500 15,500} 250,000 
1,700 13,500 90,000 
1,300 12,500 40,000 

900 11,500 10,000 


263,475 2,517 13,375 | 250,100 
102,350 1,672 12,250 90,100 
52,190 1,618 12,090 40,100 


21,600 1,100 11,500 10,100 
436,500 30,220 | 436,500 0 
4,040,100 8,040 40,100 } 4,000,000 
4,000,000 8,000 0 } 4,000,000 
10,000 400 10,000 0 
10,000 400 10,000 0 
10,000 400 10,000 0 
10,000 400 10,000 0 
10,060 400 10,000 0 
10,000 400 10,000 0 
10,000 400 10,000 (4) 
C4) 

0 

0 

0 

0 


10,000 400 10,000 

10,000 400 10,000 

10,000 400 10,000 

10,000 400 10,000 

10,000 400 10,000 

31,650 $5,595 24,100 7,550 
2,717,840 | 1,072,598 51,350 | 2,666,490 
8,038,970 11,824 | 5,362,780 | 2,676,190 


340 

0 

11,334 
1,070,486 





SSane0ceco oOo OO OCC 





* All dimensions are in micrometers. 


8,100 
8,100 
8,100 
980,100 
980,100 
8,100 
8,100 
8,100 
8,100 
10,500 
8,100 
10,500 
9,500 
8,100 
9,500 
9,000 
8,100 
9,000 
8,500 
8,100 
8,500 
10,600 
8,100 
9,600 
8,100 
9,100 
8,100 
8,100 
8,100 
8,100 
3,960,100 
3,960,100 
12,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,500 
8,100 
8,100 
8,100 


Metal 
Area 


10,000 
10,000 
10,000 
1,020,100 
1,000,000 
10,000 
10,000 
10,000 
10,000 
18,250 
10,000 
18,250 
18,647.5 
10,000 
18,647.5 
16,487.5 
10,000 
16,487.5 
14,987.5 
10,000 
14,987.5 
18,475 
10,000 
15,475 
10,000 
13,975 
10,000 
8,600 
10,000 
10,000 
4,040,100 
4,000,000 
61,100 
16,935 
16,875 
16,870 


GUARDRING (Width: 2) 


Area Peri- 
meter 





1,024,144 
8,000 . ses 
0 0 
8,000 996,004 
8,080 x 1,016,064 
11,532 ; 2,022,208 
$40,020 , 404,994 
4,048,144 / 0 
16,000 . 3,992,004 
1,080 
1,080 
680 
680 
480 
480 
280 
280 
4,220 e 276,094 
2,620 . 105,894 
1,820 x 50,794 
1,020 15,694 
0 0 
16,000 


16,080 
280 





16,820 
16,770 
16,720 
16,670 
16,565 
16,465 
16,225 
15,980 
10,000 
10,000 
10,000 


280 iY) 
22,668 22,668 | 8,017,390 
2,140,972 | 2,140,972 | 1,605,708 
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Table Ill—Pad key for chip B* (Continued) 


GUARDRING (Width: 3) 
MOS Device Peri- Enclosed 
meter Area 
3 


FOXCAP 1,000 | 1,026,169 4,052 0 | 1,030,225 1,034,289 4,080 

GOXCAP 1,000 12,000 8,000 994,009 20,000 990,025 28,000 986,049 40,000 | 8,000 

FIELD PLATE 1,000 0 UY) 0 0 0 0 0 0 0 

OVLAP 1,000 12,000 8,000 994,009 20,000 990,025 28,000 986,049 40,000 | 8,000 

NOVLAP 1,000 12,120 8,080 | 1,014,049 20,200 1,010,025 28,280 1,006,009 40,400 | 8,080 
POLYCOMB 1,000 17,298 11,532 | 2,019,330 28,830 2,013,580 40,362 2,007,838 57,660 | 11,532 
GOXCOMB 1,000 810,030 $40,020 269,994 | 2,013,580 0 |} 2,019,330 0 | 2,027,970 | 5,766 

FOXCAP 2,000 | 4,052,169 8,052 0 | 4,060,225 0 | 4,068,289 0 | 4,080,400 | 8,080 

GOXCAP 2,000 24,000 16,000 | 3,988,009 40,000 3,980,025 56,000 3,972,049 80,000 | 16,000 

IGFET: SOURCE 1,620 1,086 2,700 3,780 0 5,400 1,100 

IGFET: GATE 500 

IGFET: DRAIN 1,620 1,086 2,700 3,780 0 5,400 1,100 

IGFET: SOURCE 1,020 686 1,700 2,380 0 3,400 700 

IGFET: GATE 

IGFET: DRAIN 1,020 686 1,700 2,380 3,400 700 

IGFET: SOURCE 720 486 1,200 1,680 2,400 500 

IGFET: GATE 

IGFET: DRAIN 720 486 1,200 1,680 2,400 500 

IGFET: SOURCE 420 286 700 980 1,400 300 

IGFET: GATE 

IGFET: DRAIN 420 286 700 980 1,400 300 

GATOD: JUNC. 6,330 4,220 275,044 10,550 272,950 14,770 270,864 21,100 267,750 
GATOD: GATE 

GATOD: JUNC. 3,930 2,620 105,244 6,550 103,950 9,170 102,664 13,100 100,750 
GATOD: GATE 

GATOD: JUNC. 2,730 1,820 50,344 4,350 49,450 6,370 48,564 9,100 47,250 
GATOD: GATE 

GATOD: JUNC. 1,530 1,020 15,444 2,550 14,950 3,570 14,464 5,100 13,750 
GATOD: GATE 

FIELD PLATE tt) 0 0 0 0 4) 0 0 0 
OVLAP 24,000 16,000 | 3,988,009 40,000 3,980,025 56,000 3,972,049 3,960,100 
NOVLAP 24,120 16,080 | 4,028,049 4,020,025 56,280 4,012,009 4,000,000 
IGFET: COMSOURCE 420 286 0 700 0 980 0 1,400 0 
IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

IGFET: DRAIN 

COMGATE 420 286 0 980 294 0 1,400 o 
POLYCOMB 34,002 22,668 j 8,011,728 79,338 | 22,668 | 7,989,100 113,340 | 22,668 | 7,972,150 
GOXCOMB 3,211,458 | 2,140,972 | 1,070,470 8,011,728 | 11,322 0 | 8,028,720 0 


* All dimensions are in micrometers. 
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Table I\V—Pad key for chips C1 through C4* 


43,211,250 
FIELD PLATE 0 
OVLAP 16,000,000 
NOVLAP 16,082,100 


DOUBLE 
VANDERPAUW 


GASAD 
POLY 
GASAD 


GUARDRING 
POLY 
GUARDRING 
POLY 
GUARDRING 
POLY 
GUARDRING 
POLY 


eococoooocosd 


VANDERPAUW 


GUARDRING 
GUARDRING 
GUARDRING 
GUARDRING 


GASAD 67,075 
GASAD 67,075 
GASAD 67,075 
GASAD 67,075 


5,760,010 
26,530 

0 

16,400 
16,440 


* All dimensions are in micrometers. 


15,918,100 
15,918,100 


16,093,350 
16,093,350 
43,221,500 
14,493,350 

715,500 
16,082,100 
16,000,000 


53,252.5 
169,975 
29,597.5 
169,975 
15,077.5 
169,975 
16,952.5 
169,975 


55,600 
172,875 
29,425 
172,875 
14,300 
172,875 
16,800 
172,875 


31,350 
16,025 
11,650 
17,900 


31,387.5 
16,062.5 
11,687.5 
17,937.5 


16,890 93,350 
16,890 | 16,093,350 
27,020 | 28,821,500 
5,762,167 93,350 
56,600 715,500 
16,440 82,100 
16,400 0 


101,875 
101,875 
101,875 
101,875 


55,600 
172,875 
29,425 
172,875 
14,300 
172,875 
16,800 
172,875 


31,350 
16,025 
11,650 
17,900 


30,087.5 
14,875.0 
10,775.0 
16,762.5 


eocooocooceso 


8,100 10,000 
8,100 10,000 
8,100 10,000 
32,400 40,000 
8,100 10,000 
15,918,100 | 16,082,100 
15,918,100 | 16,000,000 


43,382.5 
34,675.0 
19,715.0 
13,125.0 
5,202.5 
3,062.5 
7,072.5 
15,325.0 


55,725 
43,875 
29,550 
23,550 
14,425 
11,675 
16,925 
26,675 


31,325 
16,025 
11,650 
17,900 


31,387.5 
16,062.5 
11,687.5 
17,937.5 


GUARDING (Width: 10) 
Peri- | Eaclosed 
meter Area 

164,000 | 32,800 } 15,918,100 
16,164,400 | 16,480 0 
43,211,250 | 26,530 0 

265,300 | 53,060 | 43,078,700 

0 0 0 


164,000 | 32,800 | 15,918,100 
164,400 | 32,880 | 16,000,000 


oooooooco 
ooooooocso 


0 
0 
0 
0 
0 
0 
0 
0 


67,812.5 
0 
67,812.5 
0 
67,812.5 
0 
67,812.5 
0 


oooococeoo 
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Table V—Pad key for METEST chip D* 


GASAD _ POLYCON 
WINDOW 


_ [ae [so [amen [ze [80-4 


wr 


NNNNNWWWH 


io) 
x4 


4 
2 
1 
4 
4 
7 
2 
2 
1 
4 
2 
1 
4 
2 
1 
4 
2 
1 
4 
1 
2 
1 
4 
2 
1 
4 
2 
1 


(oo > oo te Os SD Gs Se 
MN MAMA ANNA AAA ADA NO DNA MND nN OOO OO OHO 


* All dimensions are in micrometers. 
+ Cross-sectional area of tub input or output. 


Table Vi—Mask levels 


Suggested 

Sequence Mask Level Tone Note Features Background 
1 GUARDRING Normal 1 Clear Opaque 
2 GASAD Normal 1 Clear Opaque 
3 POLYCON Normal 1 Clear Opaque 
4 POLY Normal 1 Opaque Clear 
5 WINDOW Normal 1 Clear Opaque 
6 METAL Normal 1 Opaque Clear 
2 R. T. GASAD Reverse 2 Opaque Clear 
4 R. T. POLY Reverse 2 Clear Opaque 
6 R. T. METAL Reverse 2 Clear Opaque 


1. For use with positive photoresist. 
2. For use with negative photoresist, uniform gold metallization, selective oxidation, 
or other special processes. 
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Table Vil—Experimental devices 


METAL 


Required Mask Levels 
Device Structures GUARDRING GASAD POLYCON POLY 

"-MOSCapacitors 
"Schottky Diodes: Guarded 

Unguarded i * 

Ohmic Contacts * * * 

p-n Junctions i * * 
"IGFETs 0 
" GatedDiodes GATODE 22222227222 0% 
~SADSHEET 000 

POLYSHEET * 

VANDERPAUW: POLY * 

GASAD * * * 

GUARDRING * * * 

Contact Metallization 
Test Cells METEST * * * 









GATE 
OXIDE 


SILICON SUBSTRATE 
Fig. 1—~FOXCAP and GOXCAP MOS capacitors (chips A through C). 
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7 
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n-Si SUBSTRATE 
Fig. 2—Guarded Schottky diode and buried channel capacitor formed with FOXCAP 


and GOXCAP features (chips A through C). 
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Fig. 6—POLYSHEET and SADSHEET sheet resistance and linewidth features (chip 
A). 
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Fig. 8—Standard IGFET with common sources and drains (chip B). 
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n-Si SUBSTRATE 


Fig. 9—Guarded IGFET with common sources and drains (chip B). 
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Fig. 10—-Gated diode with n* diffusion or implantation completely surrounding the 
gate (GATODE chip B). 
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Fig. 11—Tapped string for the metallization test chip (METEST chip D). 
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| 10 um | 5 um Kal 


MET. 
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“<_ TUB DIFFUSION 
SILICON SUBSTRATE OR IMPLANTATION 
Fig. 12—Dual contact cell for the metallization test chip (METEST chip D). 
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Fig. 14—Composite POLY and WINDOW levels for the A chip. 
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Fig. 15—Composite POLY, WINDOW, and METAL levels for the B chip. Some of 
the detail has been enlarged 3X to achieve adequate resolution for this illustration. 
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Fig. 16—Composite POLY and WINDOW levels for the C1 chip. 
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Fig. 18—Composite POLY and WINDOW levels for the C3 chip. Detail has been enlarged 3X to achieve 
adequate resolution for this illustration. 
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Fig. 19—Composite GUARDRING, GASAD, POLY, and WINDOW levels for the C4 chip. The frames surrounding OVLAP 
and NOVLAP have been enlarged 3X to achieve adequate resolution for this illustration. 
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Fig. 21—Chip layout on the fine 
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(f) 


Fig. 23—MOPEP alignment features. (a) GUARDRING level. (b) GASAD level. 
(c) POLYCON level. (d) POLY level. (e) WINDOW level. (f) METAL level. 
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CHANSTOP 


C-V 
FOXCAP 
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Fig. 24—TEM test chip. 


channel stopping implantation or diffusion to avoid 
inversion of the silicon surface at the Si-SiO» interface 
capacitance measured as a function of voltage 

field oxide capacitor 

gate and source and drain feature delineated in the 
field oxide prior to gate oxidation. Also, the second 
photolithographic level in the set of fine-line process 
development masks. 

gated diode, essentially an IGFET (see below) with 
common source and drain. 

gate oxide capacitor 

a gate oxide feature with a comb-shaped structure 
electrically guarded structure, fabricated by ion im- 
plantation or diffusion, which straddles and surrounds 
the boundary of a metallization feature, forming a 
closed ring. 

six-fold or hexadic capacitor group 

insulated gate field-effect transistor 

large-scale integration 

metallization pattern, the final photolithographic 
level in the set of fine-line process development masks. 
metallization test structure consisting of tapped 
strings with contacts to underlying diffused tubs. 
modified Perkin Elmer projection alignment features 
metal-oxide-semiconductor sandwich structure used 
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NOVLAP 


OVLAP 


PEP 
POLY 


POLYCOMB 


POLYCON 


POLYSHEET 

SADSHEET 

SEM 

TEM 

VANDER- 
PAUW 


WINDOW 


REFERENCES 


Onre 


&© Co ~J om 


for electrical characterization of device fabrication 
processes 

conductive pad not overlapping field oxide and form- 
ing the top level of a metal-oxide-semiconductor ca- 
pacitor 

conductive pad overlapping field oxide and forming 
the top level of a metal-oxide-semiconductor capacitor 
Perkin Elmer projection alignment features 
polycrystalline silicon which, when patterned, forms 
a conductive electrode for electrical tests. Also, an 
intermediate photolithographic mask level in the set 
of fine-line process development masks. 

a polycrystalline silicon feature with a comb-shaped 
structure 

polycrystalline contact to underlying silicon substrate. 
Also, an intermediate photolithographic mask level in 
the set of fine-line process development masks. 
polycrystalline silicon feature for sheet resistance and 
linewidth loss measurements 

structure formed during gate and source and drain 
(GASAD) lithography to determine source and drain 
sheet resistance and linewidth loss 

scanning electron microscope 

transmission electron microscope 

a symmetric structure introduced by L. J. van der 
Pauw” to determine the electrical resistivity of thin 
conductive layers 

next to the last photolithographic mask level in the 
set of fine-line process development masks to form 
source and drain contacts to insulated gate field-effect 
transistors, junction contacts in gated diodes, and 
access to polycrystalline silicon features when the 
poly Si is covered by an intermediate dielectric. 
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A Circuit That Changes the Word Rate of Pulse 
Code Modulated Signals 


By J. C. CANDY and O. J. BENJAMIN 


(Manuscript received November 2, 1982) 


In this paper we describe a circuit that accepts pulse code modu- 
lated signals sampled at about 8 kHz and resamples them at any 
desired rate up to 512 kHz. When the sampling satisfies Nyquist’s 
criterion, the distortion introduced is at least 35 dB below the signal 
level. The circuit uses a digital low-pass filter to interpolate sample 
values, and it may be integrated as about 2500 gates on a 5 mm? 
chip. 


l. INTRODUCTION 


It often is impractical to synchronize all of the clocks of an extensive 
digital network. Consequently, data will arrive at connections out of 
synchronism and special circuits are needed to bring them into time 
with the local clock. For irregular bursts of data, synchronism can be 
easily obtained using buffer memories, but for continuous streams of 
data, such methods are useful only for very small discrepancies in 
clock frequencies. A case of particular interest in telephone networks 
is the transmission and processing of pulse code modulation (PCM). 
Changing the word rate of such data can introduce objectionable noise 
into the signal. We describe a circuit that uses digital filters to contain 
this noise. 


Il. RESAMPLING 


We know that when an analog signal, x(t), having spectral density 
X (w) and bandwidth w» is sampled at regular intervals, 7, the spectral 
density of the sampled signal can be expressed as the sum of images 


Xo) = TX (« dt =). (1) 
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When wort < 7, the original signal can be recovered by filtering out the 
images for n ¥ 0. 

If we were to change the sampling rate’ by first holding each sample 
value constant throughout its period and resampling at the new period, 
T1 < 7/w, the spectral density of the resampled signal can be expressed 
as 


x(a) = SEX (or) (a+ 22), (2) 
nk T 71 T 
where 
. [wr 
sin zy 
H(w) = ae sinc (fr). (3) 
2 


When we try to recover the original signal from X” (w) by filtering out 
the components for n # k ¥ 0, it is contaminated by cross modulation 
products for which 


0<(245)<p-e. (4) 
rT. ae 2a 
Notice that when + and 7) are integer multiples of one another, this 
condition never holds and reflected noise is absent from the baseband. 

An obvious means of eliminating in-band cross-modulation products 
from eq. (2) is to smooth out the high-frequency components of the 
sampled signal x’(t) before resampling it. This has been accomplished 
by replacing the sample and hold represented by H(w) in eq. (3) by a 
better low-pass filter that interpolates new sample values from the old 
ones.” Another implementation of this method is demodulation with 
a digital/analog (D/A) converter, analog smoothing, and remodulation 
with an analog/digital (A/D) converter. The method’ presented here 
raises the sample rate to a high multiple of the original rate, smooths 
the sample values with digital filters, holds the smoothed samples, and 
resamples them at the desired rate. 


ill. A CIRCUIT FOR RESAMPLING 8-kHz PCM 


We will describe a circuit for resampling a 3.5-kHz signal that has 
been pulse code modulated at a nominal 8-kHz rate using 16 bits per 
word. The new sampling can be at any rate up to 512 kHz and even 
higher rates can be accommodated by minor modification of the 
circuits. The technique first raises the sampling rate 16 times to 128 
kHz using digital interpolating filters to smooth out all unnecessary 
images of the signal, leaving only those that are adjacent to the new 
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rate and its harmonics. The high-frequency code is placed in a holding 
register from which the output is gated at the desired rate. We will see 
that the filter action of this hold, sine (f/128), provides more than 30- 
dB attenuation of the cross-modulation products that fold into base- 
band. 

Because the complexity of digital filters increases sharply with 
increasing sampling rate, it pays to raise the rate in stages.’ Figure 1 
illustrates the process. Our first stage raises the sampling rate four 
times from 8 to 32 kHz, employing a low-pass filter that cuts off sharply 
to attenuate spectral images between 4 and 28 kHz. The second stage 
raises the rate to 128 kHz by simple linear interpolation, and the third 
stage is a holding register. Most of the circuits used originally were 


ODD ons eo ness ee WN is 


(Gi) | 


Fig. 1—An illustration of the signal’s spectrum at various stages of the conversion: 
(a) The original signal. (b) The signal sampled at 8 kHz. (c) The response of the low- 
pass filter. (d) and (e) The output from the low-pass filter. (f) The response of the linear 
interpolator. (g) The output of the interpolator. (h) The response of the holding circuit. 
(i) The held signal. 
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designed to be part of an oversampled coder/decoder (codec).* We 
shall summarize their relevant properties in the following sections. 


IV. THE LOW-PASS FILTER 


The low-pass filter shown in Fig. 2 processes data at 32 kilowords 
per second. Its z-transform response is given by 


Dinca, ca aa 
en er sn L-yp2t+e% Lope +2" m 
a ae ee ee ae ee ee 

16 64 16° ~ 64 


Each word of its 8-kHz input signal is repeated four times and fed into 
two second-order sections. This filter attenuates the images of the 
signal in the range 4 to 28 kHz by more than 34 dB. Its output is a 
good approximation of pulse code modulation at 32 kHz. The spectral 
response of the entire resampler is shown in Fig. 3. The zeros at 4.5 
and 6 kHz are introduced by the second-order sections, and those at 8 
and 16 kHz by the repetition of input words. 
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Fig. 2—An outline of the low-pass filter, clocked at 32 kilowords per second, that is 
used to raise the sampling rate from 8 to 32 kHz. 
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Fig. 3—Spectral response of the resampler. (a) The calculated gain of the cascaded 
filters used in the resampler. (b) The group delay of the filters. 
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V. LINEAR INTERPOLATION 


Simple linear interpolation of three sample values increases the 


sampling rate from 32 to 128 kHz. This process has frequency responses 
that can be expressed as 


_ (sine (f/32) \” 
I(w) = (te) . (6) 


Its attenuation of spectral images in the range 28 to 36, 60 to 68, and 
92 to 100 kHz exceeds 40 dB. The small amount of droop introduced 
into baseband is compensated in the low-pass filter so that the entire 
circuit has inband gain in the range —0.41 to —0.57 dB. The circuit 
implementation shown as Fig. 4 is based on the results 


y(nr) = y[(n — 1)7] + % Ax(n7) 
Ax(nr) = x(4nr) — x[4(n — 1)7] 
and 
y(4nt) = x(4nz). (7) 


After each new input sample enters register Ri, the output, held in 
register R2, increments four times to make its value equal to the input. 


VI. RESAMPLING 


Figure 5 shows the circuit that is used to resample the signal at the 
desired output rate without causing conflict with the internal clock. 
Here the output from the linear interpolator is loaded into register R3 
in time with the internal clock running at 1 MHz. This loading is 
inhibited by the presence of the output clock, which after a short delay 
loads R4 from R;. The frequency response associated with the holding 
action of this circuit may be expressed as 


32-kHz CLOCK 





OUTPUT 
Y¥(nT) = 


128-kHz 
CLOCK 





Fig. 4—An outline of the linear interpolator used to raise the sampling rate from 32 
to 128 kHz. 
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Fig. 5—The circuit used to sample the output. Register Ro is part of the linear 
interpolator, R3 is loaded from R2 at 1 MHz. An output demand inhibits loading of Rs, 
and then loads R, from Rs. 
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Fig. 6—The test circuit. 





R(w) = sine (f/128). 


Images of the signal in the range 124 to 132 kHz are thereby attenuated 
by more than 30 dB. 


Vil. TESTING THE CIRCUIT 


A version of the circuit was built of standard digital circuit compo- 
nents and tested in the setup shown in Fig. 6. The net gain of the 
circuit and its signal-to-noise ratio were measured as the sampling rate 
was varied continuously from 0 to 256 kHz. For a 1.02-kHz input signal 
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sampled at 8 kHz, the gain remained constant within +.05 dB and the 
signal-to-noise ratio always exceeded 40 dB. Input amplitudes were 
varied in the range 0 to —60 dB. At lower amplitudes quantization 
noise inherent in the 16-bit input word was significant. 


VIN. CONCLUDING REMARKS 


Experience obtained while designing filters for a version of a codec 
enables us to estimate that the circuits described here can be imple- 
mented on about 5 mm?” of silicon in a standard technology. The 
performance of the circuit is good enough that imperfections intro- 
duced by resampling would be insignificant compared with those 
normally obtained from p-255 encoding of the signal. 
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